option to interrupt and resume for large numbers of points

AmericanRedCross / street-view-green-view

BSD 3-Clause "New" or "Revised" License

18 stars 17 forks source link

option to interrupt and resume for large numbers of points #61

Open danbjoseph opened 4 months ago

danbjoseph commented 4 months ago

we need a way to interrupt and then resume both assign_images and assign_gvi_to_points

I am testing a somewhat large area in Indonesia and the projected time for it to complete is more than 60 hours. Are the features stored in the geopackage (or other geo file) being written/updated as the process runs, or does it happens once at the end? It would be fantastic to have a "resume" option so that we could interrupt the process, then restart it but with it ignoring the points for which we have already downloaded/fetched an image.

Assigning Images to Points:   0%|                      | 1463/791462 [09:21<67:34:00,  3.25points/s]

danbjoseph commented 4 months ago

via @dragonejt "we only output the file at the end of the entire script run. To avoid this, we would have to catch the KeyboardInterrupt and output the file then, but catching the KeyboardInterrupt is usually not recommended I think."

are there other options? what about writing the file every 100 points or something? i guess this is even more complicated if there is a delay between the API call to find an image and the actual download of the image (see the question in #67) - on resume we would need to check all image filenames in the geopackage for a matching image file on disc and not just pickup doing lines without a image filename value?

jayqi commented 4 months ago

I agree that catching KeyboardInterrupt sounds like a weird thing to do.

There are certainly other options, and it will involve some bigger changes. For example, rather than storing the output data as a GeoPackage file, we could instead use a file-based database like SQLite or DuckDB with a geospatial extension where we can write the outputs. Or, we could write out the point-to-file mappings in a non-geospatial file format that supports streaming writes, like one JSON file per image metadata row, or using JSONL for a single file.

jayqi commented 2 months ago

Okay, here's an idea to consider that might not require changing our data structure or what we store on disk: we separate the image identification and the image download into two steps.

First, we match each point to an image (without downloading any images). This will still take some time but I expect should be much faster. Then, this can get saved.

Then, we subsequently go through and download each image. If an image is already on locally available, then we can skip that image.

danbjoseph commented 2 months ago

That sounds like it would help with Mapillary. What about if we are processing a local folder of images?

jayqi commented 2 months ago

Is processing a local folder of images slow right now? I don't think we're moving around/copying the images right now, or anything like that.

danbjoseph commented 2 months ago

i guess it took less than a minute for: 1090 points, 1286 images, 268 matches

Assigning Images to Points: 100%|███████| 1090/1090 [00:57<00:00, 19.10points/s]

if it stays that fast then something like 80,000 points div 20 points/s div 60 s/min is just over an hour, which is not fast but is reasonable