Closed falkamelung closed 2 years ago
An example of a file with many chunks where the data ingest takes very long is: /data/HDF5EOS/MaunaLoaHighResCskDT91/mintpy/CSK_SM_091_0000_20201007_20210706
I've implemented this in json_mbtiles2insarmaps.py and hdfeos5_2json_mbtiles.py. Try it with --num_workers X.
If working well, let me know, and I can close this issue.
Cool! It got very much faster! Thank you!
tippecanoe is not running in parallel. To what I see it has an option for that. Maybe that would work as well?
It actually would be good to print the tippecanoe command to the screen before it is run (something like "Now running tippecanoe .... " and then the command)
I see that things got changed with the size of the chunks (there seem to be more chunks then before). Could you explain? Maybe we could say a few words on what those chunks are in the --help
message?
The chunk size was originally 20000 points. It was a good trade off between speed and RAM usage that I found after trial and error.
Each process now is also told to create json for 20000 points per json file. However, the program doesn't know before hand which points are nan or not. So sometimes, a process can only create a json file for < 20000 points.
This is because without multiple processes, I just create a json file once we've found 20000 true (non nan) points. But for multiple processes, the way I've designed it is that we spawn the processes, and every process gets a range of points to deal with (point 0-19999, 20000-39999) etc, some of which might be nan. Long story short - it's still the same number of points at the end just each file might have less than 20k now. Making it so that each process does exactly 20k points: 1) won't offer any speed ups and 2) will make the code much more complex and difficult to deal with. It might even make it slower.
I don't really know what would go in the help message. "This program will create temporary json chunk files which, when concatenated together, comprise the whole dataset"?
yes. that would be fine. And say that tippecanoe is used for that.
Okay, I've made tippecanoe run in parallel, described what the chunk files are in the help message, and printed the tippecanoe command to the screen. Will close this issue as completed.
Just made a minor change, please pull again if you've already done so.
If we need options for multiple cores we probably should use
--num-processors
or--num-workers
We currently have these option in the controls files for running the
multiprocessing
anddask
modules, respectively:From: "Mirzaee, Sara" sara.mirzaee@rsmas.miami.edu Subject: Re: [EXTERNAL] [EXTERNAL] Insarmaps Date: July 27, 2022 at 11:31:17 PM EDT To: "Amelung, Falk C" famelung@rsmas.miami.edu
No I use multiprocessing differently, it is implemented here: https://github.com/insarlab/MiaplPy/blob/main/miaplpy/phase_linking.py
I process each patch on one core and save the result to a numpy format file which is very fast Then I concatenate them all at the end