geodesymiami / insarmaps

3 stars 0 forks source link

ingest scripts in parallel ? #62

Closed falkamelung closed 2 years ago

falkamelung commented 2 years ago

If we need options for multiple cores we probably should use --num-processors or --num-workers
We currently have these option in the controls files for running themultiprocessing and dask modules, respectively:

miaplpy.multiprocessing.numProcessor = 20
mintpy.compute.numWorker            = 32

From: "Mirzaee, Sara" sara.mirzaee@rsmas.miami.edu Subject: Re: [EXTERNAL] [EXTERNAL] Insarmaps Date: July 27, 2022 at 11:31:17 PM EDT To: "Amelung, Falk C" famelung@rsmas.miami.edu

No I use multiprocessing differently, it is implemented here: https://github.com/insarlab/MiaplPy/blob/main/miaplpy/phase_linking.py

I process each patch on one core and save the result to a numpy format file which is very fast Then I concatenate them all at the end

falkamelung commented 2 years ago

An example of a file with many chunks where the data ingest takes very long is: /data/HDF5EOS/MaunaLoaHighResCskDT91/mintpy/CSK_SM_091_0000_20201007_20210706

stackTom commented 2 years ago

I've implemented this in json_mbtiles2insarmaps.py and hdfeos5_2json_mbtiles.py. Try it with --num_workers X.

If working well, let me know, and I can close this issue.

falkamelung commented 2 years ago

Cool! It got very much faster! Thank you!

tippecanoe is not running in parallel. To what I see it has an option for that. Maybe that would work as well?

It actually would be good to print the tippecanoe command to the screen before it is run (something like "Now running tippecanoe .... " and then the command)

I see that things got changed with the size of the chunks (there seem to be more chunks then before). Could you explain? Maybe we could say a few words on what those chunks are in the --help message?

stackTom commented 2 years ago

The chunk size was originally 20000 points. It was a good trade off between speed and RAM usage that I found after trial and error.

Each process now is also told to create json for 20000 points per json file. However, the program doesn't know before hand which points are nan or not. So sometimes, a process can only create a json file for < 20000 points.

This is because without multiple processes, I just create a json file once we've found 20000 true (non nan) points. But for multiple processes, the way I've designed it is that we spawn the processes, and every process gets a range of points to deal with (point 0-19999, 20000-39999) etc, some of which might be nan. Long story short - it's still the same number of points at the end just each file might have less than 20k now. Making it so that each process does exactly 20k points: 1) won't offer any speed ups and 2) will make the code much more complex and difficult to deal with. It might even make it slower.

stackTom commented 2 years ago

I don't really know what would go in the help message. "This program will create temporary json chunk files which, when concatenated together, comprise the whole dataset"?

falkamelung commented 2 years ago

yes. that would be fine. And say that tippecanoe is used for that.

stackTom commented 2 years ago

Okay, I've made tippecanoe run in parallel, described what the chunk files are in the help message, and printed the tippecanoe command to the screen. Will close this issue as completed.

stackTom commented 2 years ago

Just made a minor change, please pull again if you've already done so.