Issue in running Anuga on higher number of processes

samcom12 commented 2 years ago

Hi, I have tried installing Anuga with the combination of compilers and MPI implementations like GCC and OpenMPI, GCC and MPICH, Intel compilers and Intel MPI, Intel-Oneapi-compilers and Intel-Oneapi-MPI.

The simulation works well till 16 nodes and 48 MPI processes per node i.e 768 processes. However, for scaling on more nodes like 32 nodes with 48 MPI processes per node it fails with some messages like below in each above combination.


 File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 786, in exec_module
  File "<frozen importlib._bootstrap_external>", line 922, in get_code
  File "<frozen importlib._bootstrap_external>", line 979, in get_data
OSError: [Errno 24] Too many open files: '/home/apps/spack/opt/spack/linux-centos7-cascadelake/gcc-10.3.0/py-scipy-1.7.1-hmjpuclzfylrfolqqwodbnjnl76y6nbw/lib/python3.9/site-packages/scipy/interpolate/_fitpack_impl.py'
Abort(135378319) on node 5906 (rank 5906 in comm 0): Fatal error in PMPI_Mprobe: Other MPI error, error stack:
PMPI_Mprobe(117)..............:  MPI_Mprobe(source=0, tag=1, comm=MPI_COMM_WORLD, message=0x7fff4600cf10, status=0x7fff4600cf20)
PMPI_Mprobe(100)..............:
MPID_Mprobe(188)..............:
progress_test(100)............:
MPIDI_OFI_handle_cq_error(951): OFI poll failed (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(673297807) on node 5909 (rank 5909 in comm 0): Fatal error in PMPI_Mprobe: Other MPI error, error stack:
PMPI_Mprobe(117)..............:  MPI_Mprobe(source=0, tag=1, comm=MPI_COMM_WORLD, message=0x7fff8405e110, status=0x7fff8405e120)
PMPI_Mprobe(100)..............:
MPID_Mprobe(185)..............:
MPIDI_improbe_safe(135).......:
MPIDI_improbe_unsafe(79)......:
MPIDI_OFI_do_iprobe(69).......:
MPIDI_OFI_handle_cq_error(951): OFI poll failed (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(269596047) on node 5917 (rank 5917 in comm 0): Fatal error in PMPI_Mprobe: Other MPI error, error stack:
PMPI_Mprobe(117)..............:  MPI_Mprobe(source=0, tag=1, comm=MPI_COMM_WORLD, message=0x7ffd5a330030, status=0x7ffd5a330040)
PMPI_Mprobe(100)..............:
MPID_Mprobe(188)..............:
progress_test(100)............:
MPIDI_OFI_handle_cq_error(951): OFI poll failed (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(940684687) on node 5921 (rank 5921 in comm 0): Fatal error in PMPI_Mprobe: Other MPI error, error stack:
PMPI_Mprobe(117)..............:  MPI_Mprobe(source=0, tag=1, comm=MPI_COMM_WORLD, message=0x7ffcd06d0610, status=0x7ffcd06d0620)
PMPI_Mprobe(100)..............:
MPID_Mprobe(188)..............:
progress_test(100)............:
MPIDI_OFI_handle_cq_error(951): OFI poll failed (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(806466959) on node 5925 (rank 5925 in comm 0): Fatal error in PMPI_Mprobe: Other MPI error, error stack:
PMPI_Mprobe(117)..............:  MPI_Mprobe(source=0, tag=1, comm=MPI_COMM_WORLD, message=0x7ffc0c9d5610, status=0x7ffc0c9d5620)

I have run examples for MPI4PY with all combinations and it runs well on a large number of nodes like 32.

stoiver commented 2 years ago

@samcom12

Strange problem. Do your error messages which give any more hint where in the anuga code the error occurs.

I notice that the error seems to occur when calling a scipy procedure

scipy/interpolate/_fitpack_impl.py

Could you provide your anuga script . That might give some idea of what is happening.

samcom12 commented 2 years ago

Hi @stoiver Thanks for your prompt response.

I have tried with sample parallel example as well it runs well for 16 nodes and fails for 64 nodes. Can you please share a test case which we can try out on larger MPI processes? You can refer to the attached log files FYR.

mahanadidelta.13302.txt mahanadidelta.13304.txt slurm_job.txt

stoiver commented 2 years ago

Hi @samcom12

I have updated the example you were using, run_parallel_sw_rectangular_tsunami.py in anuga_core/examples/parallel, so that it should run with more processors. Just changed the refinement of the triangulation to increase as the number of processes is increased. You will have pull the latest version from anuga_community (branch main).

samcom12 commented 2 years ago

Hi @stoiver

I tried with your changes in the mainbranch. Still, it hangs up in the distribution of domains.

Please see logs FY slurm-out.txt R.

stoiver commented 2 years ago

Hi @samcom12,

The error is saying that at least one of the partitions created by metis has no triangles. Your mesh has 6400 triangles, so that is too small to subdivide onto 2304 processes. You need to try an example with more triangles. Indeed as a back of the envelop recommendation I would suggest subdividing so that you have at least 1000 triangles on each process. That provides an upper limit on the number of processes for a particular mesh. If your example is based on the old version of run_parallel_sw_rectangular_tsunami.py then you can increase the number of triangles by decreasing the size of dx and dy in the code.

The new run_parallel_sw_rectangular_cross.py and run_parallel_sw_rectangular_tsunami.py have been updated to increase the number of triangles in a mesh in concert with an increase in the number of processes being requested.

By the way, I would suggest using the DE0 (the default) or the DE1 algorithms. The 'tsumami' algorithm is an old algorithm and is not recommended. DE0 is more diffusive than DE1, but runs about twice as fast as DE1. For a tsunami simulation I would suggest that DE0 is sufficient.

samcom12 commented 2 years ago

Hi @stoiver,

Thanks for your valuable comments. With our dataset, I am running it on 64 nodes and it keeps executing indefinitely in the interpolatefunction. Whereas earlier Python2 implementation of Anuga works well for the same case. You can refer to the logs attached.

slurm-13395.txt

Profiling output- output

samcom12 commented 2 years ago

Hi @stoiver ,

Can you look at similar threads at - https://groups.google.com/g/mpi4py/c/liqIEMDAYL8?pli=1

The above thread talks about comm.recv function

You have to use comm.recv(None, source, tag).

If you use comm.recv(N, ...), where N is an integer, it means a totally different thing (an upper-bound to the number of bytes for the incoming pickled message).

This "feature" is really error prone, it is there as a workaround for old times where MPI implementations do not have matched probes (introduced in MPI-3), and receiving pickled messages was not thread safe (something I fixed recently if you run with pre-MPI-3 implementations). slurm_out.txt

dalcinl commented 2 years ago

@samcom12 You submitted an mpi4py issue referencing this one. I'm a bit confused. What am I supposed to look at? Is there any actual bug in mpi4py I should take care of? Or am I being asked for advice? Most of the times users have issues with mpi4py, the root of the problem is the backend MPI implementation, and there is very little I can do in these cases. Sometimes, the issues are related to mpi4py being used incorrectly, mostly because of bad docs and/or legacy (mis)features.

samcom12 commented 2 years ago

@dalcinl Previously, Anuga was using pypar for parallelization. Now it uses mpi4py by writing an abstract file for changing pyparcalls to mpi4pycalls.

anuga/utilities/parallel_abstraction.py

Can you see the above file and check if everything is per mpi4pysyntax.

I have found a similar issue at https://groups.google.com/g/mpi4py/c/liqIEMDAYL8?pli=1

Otherwise, I have tried with mpi4py examples which work well for a large number of mpi ranks as well.

stoiver commented 2 years ago

@samcom12 just attending a workshop this week. so not a lot of time to think about your problem.

But can you remind me where we stand.

It seemed that you have a code that falls over using an interpolation procedure (in parallel). Is this before or after the mesh has been partitioned and distributed?

I wonder if there is a problem with opening the file tide_data/output/paradip_tide_15_09_2021.tms multiple times?

It would be great to see the python script you are using.

I can't see that we are calling comm.recv with a buffer. We are using comm.Recv with a buffer, but that should work as the buffer should be a numpy array and so the size should be obtained automatically.

Cheers Steve

stoiver commented 2 years ago

@samcom12 thanks for the copy of the script. Looks ok.

Couple of things.

(1) could you run one of the new parallel example problems, for instance run_parallel_sw_rectangular_cross.py and document the results.

(2) For your script could you add in some barrier calls after the distribute call and after the setup of the boundary conditions.

(3) Indeed could you try running without the setup of the Time boundary. Just use the reflective BC.

Could you report back on the outcomes.

Cheers Steve

samcom12 commented 2 years ago

Hi @stoiver,

I tried running run_parallel_sw_rectangular_cross.py test case. with 3072 MPI ranks, it gets some mpi4py related errors.

Please see attached logs FYR.

slurm-7618.out.txt run.sh.txt parallel_sw_rect_7618.log.txt anuga_log.txt

stoiver commented 2 years ago

@samcom12 thanks for the extra information.

Really don't know why we are getting the error. At least your results have narrowed it down to a particular send/recv combination, though the particular collection of objects being communicated are quite complex. I think to track down the problem I will have to do a bit of breaking down of the complex communication call into separate communication calls.

Unfortunately that will take a bit of time.

But in the meantime it would be great if you could run the script again and see if it fails at the same point, ie process 1045.

Cheers Steve

samcom12 commented 2 years ago

Thanks, @stoiver for your instant reply. I ran it 4 times before posting here it either fails at 1044 or 1045.

samcom12 commented 2 years ago

When I try it with Intel MPI and Intel compilerit just executes indefinitely. with MPICHand GCCit encounters the above error and exits. I will see the respective error codes and their meaning in both implementations.

samcom12 commented 2 years ago

Hi @stoiver slurm-7646.txt slurm-7645.txt slurm-7644.txt slurm-7643.txt slurm-7642.txt slurm-7641.txt slurm-7636.txt slurm-7634.txt slurm-7633.txt slurm-7632.txt slurm-7631.txt slurm-7630.txt slurm-7629.txt ,

Attaching some more logs FYR.

samcom12 commented 2 years ago

Hello @stoiver,

We can make changes according to MPI implementation being used. Please see the snippet of the code.

try:
    # --- If mpi4py is going to be used, this needs to be imported
    from mpi4py import MPI
    # --- Change MPI Comm type depending on MPICH (int) or OpenMPI (void*)
    if MPI._sizeof(MPI.Comm) == ctypes.sizeof(ctypes.c_int):
        _MPI_Comm_type = ctypes.c_int
    else:
        _MPI_Comm_type = ctypes.c_void_p
except ImportError:
    MPI = None
    _MPI_Comm_type = ctypes.c_void_p

stoiver commented 2 years ago

@samcom12 Are you suggesting the above code should be added to parallel_abstraction.py?

samcom12 commented 2 years ago

@stoiver Yes, this code will just handle the return code for different MPI implementations. But it doesn't solve our original problem. I have gone through some of the codes that use mpi4py like-

https://github.com/pypr/pysph https://github.com/ECP-WarpX/WarpX

And here in both codes, they have a load balancer and better exception handling.

Also, gone through thread at mpi4py for handling different MPI runtime at- https://github.com/mpi4py/mpi4py/issues/133

Can you please share the developer doc for Anuga with mpi4py so that I can also try some experiments from our end?

samcom12 commented 2 years ago

Hi @stoiver, I have tried with the recent commit. still it's failing for larger MPI ranks.

stoiver commented 2 years ago

@samcom12, just a thought. We can run parallel jobs by first running a sequential script that creates the original domain and partitions the domain and dumps pickled files for each partition. Then fire up a parallel job which loads each pickled files and creates parallel domain. This is done via the sequential distribution code.

stoiver commented 2 years ago

@samcom12, I just added an example script, anuga_core/examples/parallel/run_sequential_dump_parallel_load.py which uses pickled files to communicate the partitioned domains. Indeed you can break the script into two, first create the large sequential domain and partition it, and another script to load in the partitions and run the evolution.

stoiver commented 2 years ago

@samcom12 It would be possible to run the sequential component of anuga_core/examples/parallel/run_sequential_dump_parallel_load.py on a large memory node and the parallel evolve part of the script on smaller memory nodes.

samcom12 commented 2 years ago

Thanks, @stoiver. Will try it out and let you know the results.

stoiver commented 2 years ago

@samcom12 Just reviewing your original problem where the error seemed to have been caused by an OSError associated with too many open files. That could be quite subtle as it was referring to a scipy interpolation file. I wonder if that has something to do with where you have installed scipy. It should be installed where it is efficiently available to processes running on any node.

samcom12 commented 2 years ago

@samcom12 Just reviewing your original problem where the error seemed to have been caused by an OSError associated with too many open files. That could be quite subtle as it was referring to a scipy interpolation file. I wonder if that has something to do with where you have installed scipy. It should be installed where it is efficiently available to processes running on any node.

Thanks, @stoiver I have installed scipyon both shared home mounted on lusterand home dir mounted on NFS. Today I tried a new setup with the latest changes to anuga repo. It runs well for 32 nodes and 64 nodes now. But, it doesn't scale at all. for 32 nodes (32X48 MPI ranks) it took 115minutes and with 64 nodes (64X48 MPI ranks) it takes 111minutes. Good that earlier it was not running at all but now it at least runs.

stoiver commented 2 years ago

@samcom12, good to hear that the code is working.

Can you give some extra details on your code. How many triangles in your model. Can you give a break down of how long the creation and distribution of the original sequential domain and then how long does the evolve loop take.

By the way do you have timings from the old python 2.7 ANUGA code.

In the old ANUGA python 2.7 code (pre mpi4py) we used MPI Isend and Irecv which had good scalability. In the current code we are using the mpi4py Sendrecv which may be the problem.

By the way, thanks for continuing to help with improving the current version of the code. It is very useful in particular as I don't have easy access to a large computer at the moment.

samcom12 commented 2 years ago

Hi @stoiver, Greetings of the Day!!

Did you get time to look into Anuga runs on 32 and 64 nodes? Let me know if we can try out any other parameters?

stoiver commented 2 years ago

Hi Samir,

Seems that your problem is not scaling well. Unfortunately I don't have easy access to a large machine at the moment. My allocation expired last year and I haven't got around to reapplying yet. So might need your help to run simulations.

One question, are you doing anything involving communication within the evolve loop, ie running lots of inlet operators,etc. If so, it would be instructive to run a short simulation without any operators which use communication. A rate operator should be ok to add water over the whole domain.

Cheers Steve

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Samir Shaikh @.> Sent: Monday, March 7, 2022 10:37:37 PM To: anuga-community/anuga_core @.> Cc: Stephen Roberts @.>; Mention @.> Subject: Re: [anuga-community/anuga_core] Issue in running Anuga on higher number of processes (Issue #10)

Hi @stoiverhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstoiver&data=04%7C01%7Cstephen.roberts%40anu.edu.au%7C5d0f5b8e3d494c88519b08da003313b2%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637822516620944347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SNdDDmjfipEEwDlXVjJNtL17kL5mGHUHx5d93Xd5dGY%3D&reserved=0, Greetings of the Day!!

Did you get time to look into Anuga runs on 32 and 64 nodes? Let me know if we can try out any other parameters?

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fanuga-community%2Fanuga_core%2Fissues%2F10%23issuecomment-1060618396&data=04%7C01%7Cstephen.roberts%40anu.edu.au%7C5d0f5b8e3d494c88519b08da003313b2%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637822516620944347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fTgRPUwgDEUjRdVrTBAKuUXfup%2FsCmjqcbDx5ydQBTU%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABKOKJDNM3N4GXAZEFL2RALU6XWQTANCNFSM5HRY3OGQ&data=04%7C01%7Cstephen.roberts%40anu.edu.au%7C5d0f5b8e3d494c88519b08da003313b2%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637822516620944347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=wnuMYH2kmjHkpp2DERLRUt6tWTj7G7p9INx%2Fs3L%2FcKk%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cstephen.roberts%40anu.edu.au%7C5d0f5b8e3d494c88519b08da003313b2%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637822516620944347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=YFtFQG0HQKwGN3TXGaeqHohksDs4MDNAV%2BLXqIRH5tY%3D&reserved=0 or Androidhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cstephen.roberts%40anu.edu.au%7C5d0f5b8e3d494c88519b08da003313b2%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637822516621100576%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ysCdQFPPZApu%2B3p%2FhptwNAZ%2FUyYeqJb6D85XZ0CT%2FLA%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

samcom12 commented 2 years ago

Hi @stoiver,

In the EVOLVEstep, we are using some rain operators. At each yield step, we are using different rainfall data files.

Is there a way to do it in an efficient way? I have seen linear scaling without the use of rainfall data.

Cheers, Samir Shaikh

stoiver commented 2 years ago

@samcom12, great to hear that you are getting linear scaling without the rainfall data.

I can imagine ways in which opening data files each yield step could cause problems.

Could you provide a simple version of your evolve loop to see what you are doing and to see if we can find a scalable way to implement it. One idea that immediately comes to mind is to open your data file on one processor (say myid == 0) and then use mpi to communicate the info to the other processors.

samcom12 commented 2 years ago

Thanks, @stoiver for a quick reply.

Below is a snippet of the EVOLVE loop we used.

# ------------------------------------------------------------------------------
# Evolution
# ------------------------------------------------------------------------------
if myid == 0 and verbose: print('EVOLVE')

barrier()
import time

t0 = time.time()
rain_set_zero = True
for t in domain.evolve(yieldstep=yieldstep, finaltime=finaltime):
    if myid == 0: domain.write_time()
    fltStr = str(t)
    rplStr = fltStr.replace(".", "_")
    imd_daily_rain25 = "rainfall_data/imd/daily/rgdata_rain_25/" + Current_Date.strftime(
        TIMEFORMATCU) + "/imd_" + Current_Date.strftime(TIMEFORMATCU) + "_%s.csv" % rplStr
    imd_daily_daily_pt_bhub = "rainfall_data/imd/daily/pt_data_bhubaneshwar/" + Current_Date.strftime(
        TIMEFORMATCU) + "/imd_" + Current_Date.strftime(TIMEFORMATCU) + "_%s.csv" % rplStr
    imd_rain_file_wrf = "rainfall_data/imd/wrf/" + Current_Date.strftime(
        TIMEFORMATCU) + "/imd_" + Current_Date.strftime(
        TIMEFORMATCU) + "_%s.csv" % rplStr
    imd_rain_file_gfs = "rainfall_data/imd/gfs/" + Current_Date.strftime(
        TIMEFORMATCU) + "/imd_" + Current_Date.strftime(
        TIMEFORMATCU) + "_%s.csv" % rplStr
    gpm_rain_file = "rainfall_data/gpm/" + Current_Date.strftime(TIMEFORMATCU) + "/gpm_" + Current_Date.strftime(
        TIMEFORMATCU) + "_%s.csv" % rplStr
    gfs_rain_file = "rainfall_data/gfs/" + Current_Date.strftime(TIMEFORMATCU) + "/gfs_" + Current_Date.strftime(
        TIMEFORMATCU) + "_%s.csv" % rplStr
    if not rain_set_zero and len(np.where(daily_time == t)[0]) == 1:
        rain_set_zero = True;
    if rain_set_zero:
        if os.path.exists(imd_daily_daily_pt_bhub):
            if myid == 0: print(
                    "Setting up imd daily rainfall file %s !!" % imd_daily_daily_pt_bhub)
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            for tri_index in range(len(domain)):
                vertices = domain.get_triangles(tri_index)
                triangle_index_rain.append(tri_index)
                triangle_index_elevation = []
                triangle_index_elevation.insert(0, np.double(
                    linecache.getline(imd_daily_daily_pt_bhub, vertices[0] + 1)) * imd_rainfall_factor_pt_bhub)
                triangle_index_elevation.insert(1, np.double(
                    linecache.getline(imd_daily_daily_pt_bhub, vertices[1] + 1)) * imd_rainfall_factor_pt_bhub)
                triangle_index_elevation.insert(2, np.double(
                    linecache.getline(imd_daily_daily_pt_bhub, vertices[2] + 1)) * imd_rainfall_factor_pt_bhub)
                triangle_index_elvation_main_rain.append(triangle_index_elevation)  # print triangle_index_elevation
            domain.set_quantity('Rain',
                                numeric=triangle_index_elvation_main_rain,  # triangle_elevation
                                use_cache=cache,
                                verbose=True,
                                alpha=0.1, indices=triangle_index_rain,
                                location='vertices')  # alpha refer user manual page 98 | removed ,location='cen$

            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            linecache.clearcache()
            rain_opertor.set_rate(rate=Q)
            rain_set_zero = False
        elif os.path.exists(imd_daily_rain25):
            if myid == 0: print(
                    "Setting up imd daily 25 rainfall file %s !!" % imd_daily_rain25)
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            for tri_index in range(len(domain)):
                vertices = domain.get_triangles(tri_index)
                triangle_index_rain.append(tri_index)
                triangle_index_elevation = []
                triangle_index_elevation.insert(0, np.double(
                    linecache.getline(imd_daily_rain25, vertices[0] + 3)) * imd_rainfall_factor_rgdata_rain25)
                triangle_index_elevation.insert(1, np.double(
                    linecache.getline(imd_daily_rain25, vertices[1] + 3)) * imd_rainfall_factor_rgdata_rain25)
                triangle_index_elevation.insert(2, np.double(
                    linecache.getline(imd_daily_rain25, vertices[2] + 3)) * imd_rainfall_factor_rgdata_rain25)
                triangle_index_elvation_main_rain.append(triangle_index_elevation)  # print triangle_index_elevation
            domain.set_quantity('Rain',
                                numeric=triangle_index_elvation_main_rain,  # triangle_elevation
                                use_cache=cache,
                                verbose=True,
                                alpha=0.1, indices=triangle_index_rain,
                                location='vertices')  # alpha refer user manual page 98 | removed ,location='cen$

            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            linecache.clearcache()
            rain_opertor.set_rate(rate=Q)
            rain_set_zero = False
        elif os.path.exists(imd_rain_file_wrf):
            if myid == 0: print(
                    "Setting up imd wrf rainfall file %s !!" % imd_rain_file_wrf)
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            for tri_index in range(len(domain)):
                vertices = domain.get_triangles(tri_index)
                triangle_index_rain.append(tri_index)
                triangle_index_elevation = []
                triangle_index_elevation.insert(0, np.double(
                    linecache.getline(imd_rain_file_wrf, vertices[0] + 1)) * imd_rainfall_factor_wrf)
                triangle_index_elevation.insert(1, np.double(
                    linecache.getline(imd_rain_file_wrf, vertices[1] + 1)) * imd_rainfall_factor_wrf)
                triangle_index_elevation.insert(2, np.double(
                    linecache.getline(imd_rain_file_wrf, vertices[2] + 1)) * imd_rainfall_factor_wrf)
                triangle_index_elvation_main_rain.append(triangle_index_elevation)  # print triangle_index_elevation
            domain.set_quantity('Rain',
                                numeric=triangle_index_elvation_main_rain,  # triangle_elevation
                                use_cache=cache,
                                verbose=True,
                                alpha=0.1, indices=triangle_index_rain,
                                location='vertices')  # alpha refer user manual page 98 | removed ,location='cen$

            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            linecache.clearcache()
            rain_opertor.set_rate(rate=Q)
        elif os.path.exists(imd_rain_file_gfs):
            if myid == 0: print(
                    "Setting up imd gfs rainfall file %s !!" % imd_rain_file_gfs)
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            for tri_index in range(len(domain)):
                vertices = domain.get_triangles(tri_index)
                triangle_index_rain.append(tri_index)
                triangle_index_elevation = []
                triangle_index_elevation.insert(0, np.double(
                    linecache.getline(imd_rain_file_gfs, vertices[0] + 1)) * imd_rainfall_factor_gfs)
                triangle_index_elevation.insert(1, np.double(
                    linecache.getline(imd_rain_file_gfs, vertices[1] + 1)) * imd_rainfall_factor_gfs)
                triangle_index_elevation.insert(2, np.double(
                    linecache.getline(imd_rain_file_gfs, vertices[2] + 1)) * imd_rainfall_factor_gfs)
                triangle_index_elvation_main_rain.append(triangle_index_elevation)  # print triangle_index_elevation
            domain.set_quantity('Rain',
                                numeric=triangle_index_elvation_main_rain,  # triangle_elevation
                                use_cache=cache,
                                verbose=True,
                                alpha=0.1, indices=triangle_index_rain,
                                location='vertices')  # alpha refer user manual page 98 | removed ,location='cen$

            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            linecache.clearcache()
            rain_opertor.set_rate(rate=Q)
        elif os.path.exists(gpm_rain_file):
            if myid == 0: print(
                    "Setting up NOAA GPM rainfall file %s !!" % gpm_rain_file)
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            for tri_index in range(len(domain)):
                vertices = domain.get_triangles(tri_index)
                triangle_index_rain.append(tri_index)
                triangle_index_elevation = []
                triangle_index_elevation.insert(0, np.double(
                    linecache.getline(gpm_rain_file, vertices[0] + 1)) * gpm_rainfall_factor)
                triangle_index_elevation.insert(1, np.double(
                    linecache.getline(gpm_rain_file, vertices[1] + 1)) * gpm_rainfall_factor)
                triangle_index_elevation.insert(2, np.double(
                    linecache.getline(gpm_rain_file, vertices[2] + 1)) * gpm_rainfall_factor)
                triangle_index_elvation_main_rain.append(triangle_index_elevation)  # print triangle_index_elevation
            domain.set_quantity('Rain',
                                numeric=triangle_index_elvation_main_rain,  # triangle_elevation
                                use_cache=cache,
                                verbose=True,
                                alpha=0.1, indices=triangle_index_rain,
                                location='vertices')  # alpha refer user manual page 98 | removed ,location='cen$
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            linecache.clearcache()
            rain_opertor.set_rate(rate=Q)
        elif os.path.exists(gfs_rain_file):
            if myid == 0: print(
                    "Setting up NOAA GFS rainfall file %s !!" % gfs_rain_file)
            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            for tri_index in range(len(domain)):
                vertices = domain.get_triangles(tri_index)
                triangle_index_rain.append(tri_index)
                triangle_index_elevation = []
                triangle_index_elevation.insert(0, np.double(
                    linecache.getline(gfs_rain_file, vertices[0] + 1)) * gfs_rainfall_factor)
                triangle_index_elevation.insert(1, np.double(
                    linecache.getline(gfs_rain_file, vertices[1] + 1)) * gfs_rainfall_factor)
                triangle_index_elevation.insert(2, np.double(
                    linecache.getline(gfs_rain_file, vertices[2] + 1)) * gfs_rainfall_factor)
                triangle_index_elvation_main_rain.append(triangle_index_elevation)  # print triangle_index_elevation
            domain.set_quantity('Rain',
                                numeric=triangle_index_elvation_main_rain,  # triangle_elevation
                                use_cache=cache,
                                verbose=True,
                                alpha=0.1, indices=triangle_index_rain,
                                location='vertices')  # alpha refer user manual page 98 | removed ,location='cen$

            triangle_index_rain = []
            triangle_index_elvation_main_rain = []
            linecache.clearcache()
            rain_opertor.set_rate(rate=Q)
        elif rain_set_zero:
            print("The Rainfall IMD/GPM/GFS files does not exist setting rain to zero !!")
            domain.set_quantity('Rain', 0.00) #set rainfall with hardcoded value = 4mm #modifed by RK
            rain_opertor.set_rate(rate=Q)
    else:
        if myid == 0: print("Using previously set Daily Rainfall!!")

stoiver commented 2 years ago

Hi @samcom12,

Trying to get my head around your evolve loop. It seems that you are updating your rainfall rate quantity every yield step. Do you want to change the rain data file every yieldstep. If not you should pull that out of the evolve loop and setup the value of the rain quantity before the evolve loop.

Could you explain the format and content of the csv rain rate files. They seem to hold the rain rate at each of the vertices of the mesh. How do you deal with the change of vertex indexing when the domain is partitioned? There is a low level array associated with partitioned domains, domain.node_l2g which should provide a translation between local vertex indices and their global index. It might even be sensible to create the rain rate quantity before distributing.

Cheers @stoiver

Girishchandra-Yendargaye commented 2 years ago

Hello @stoiver ,

Yes ..You have correctly pointed out ..we are getting rainfall forecast for every 3 hours and in order to maintain real scenario as well as accuracy.. we are passing in this way instead of passing it as cumulative data at once before distribute.

Yes Rainfall csv hold rain data of each vertices. vertices = domain.get_triangles is giving correct indices we have checked it by passsing rainfall only for certain area at our side. Kindly request you to suggest if we are doing it wrongly also please suggest other efficient ways/method(if any) to change data every yieldstep as data is ready before running simulation.

Thank you, Girish

stoiver commented 2 years ago

Hi @Girishchandra-Yendargaye, what is the value of your yieldstep? Is it 3hr or shorter?

samcom12 commented 2 years ago

Hi @stoiver, we are using a yield-step of 3hr.

stoiver commented 2 years ago

@samcom12 and @Girishchandra-Yendargaye, A little more info about the structure of the rain input files would be useful. Are they csv files with just one column, rain intensity, each row corresponding to a vertex of the mesh? Is that ordering the original or partitioned domain?

Using a for loop to loop through each triangle of the mesh and then searching through the input files would be very time consuming.

Couple of suggestions:

First use centroid values instead of vertex values. The rain operator applies the intensity over the triangle based on the centroid value. So might as well go straight to centroid values.

I assume your original rain intensity comes from a raster file of some sort, and you have then interpolated that info onto vertex values. I would recommend during this pre processing stage, to instead save centroid values into a numpy array (and store it as a binary via numpy save if necessary). A file associated with each process could then be created. Reading in the data would then be much faster.

Another problem at the moment is the parsing of the csv file simultaneously by all the processes. I think this could be the main problem. Maybe the file is locked by each process sequentially, which would cause a large bottle neck. My feeling is that reading a full file will be reasonably efficient, but parsing the same file multiple times from multiple processes would be very slow.

So my rough suggestion would be in the evolve loop, yieldstep every 3hr, then either:

read in the original raster data which covers the complete original domain extent, then use the set_quantity method (with argument raster) to interpolate to the centroid values (location=centroids). THis would assume your original raster data is UTM based (as opposed to lat long). Let me know if your raster file is lat/long based and I should be able to put together a procedure fairly easily.
or pre-process your rain intensity to the centroid values of the sub-domains (either stored numpy arrays or dumped to files) and read them in at each yieldstep.

Girishchandra-Yendargaye commented 2 years ago

@stoiver Thank you for looking into it. Following are answers to your queries, Are they csv files with just one column, rain intensity, each row corresponding to a vertex of the mesh? Answer: Yes. Correct only one column for rain intensity. Ordering is original domain

Using a for loop to loop through each triangle of the mesh and then searching through the input files would be very time consuming. Answer: Yes correct but not much as of now. Also we are reading values using line cache i.e linecache.getline so hope it is not locking file..

Regarding your suggestions,

First use centroid values instead of vertex values ..Ok we will check and change it accordingly
I assume your original rain intensity comes from a raster file of some sort, and you have then interpolated that info onto vertex values....Yes correct
I would recommend during this pre processing stage, to instead save centroid values into a numpy array (and store it as a binary via numpy save if necessary)....Ok
A file associated with each process could then be created. Reading in the data would then be much faster.....ok
read in the original raster data which covers the complete original domain extent, then use the set_quantity method (with argument raster) to interpolate to the centroid values (location=centroids). This would assume your original raster data is UTM based (as opposed to lat long). ...we have multiple option for rainfall gpm, gfs so whichever is available at that time will be picked up
Let me know if your raster file is lat/long based and I should be able to put together a procedure fairly easily....Yes it is lat long based
or pre-process your rain intensity to the centroid values of the sub-domains (either stored numpy arrays or dumped to files) and read them in at each yieldstep....we do have separate pre processing step which generates csv file so we will try this as well and let you know

anuga-community / anuga_core

Issue in running Anuga on higher number of processes #10