Open danbjoseph opened 4 months ago
via @dragonejt "we only output the file at the end of the entire script run. To avoid this, we would have to catch the KeyboardInterrupt
and output the file then, but catching the KeyboardInterrupt
is usually not recommended I think."
are there other options? what about writing the file every 100 points or something? i guess this is even more complicated if there is a delay between the API call to find an image and the actual download of the image (see the question in #67) - on resume we would need to check all image filenames in the geopackage for a matching image file on disc and not just pickup doing lines without a image filename value?
I agree that catching KeyboardInterrupt
sounds like a weird thing to do.
There are certainly other options, and it will involve some bigger changes. For example, rather than storing the output data as a GeoPackage file, we could instead use a file-based database like SQLite or DuckDB with a geospatial extension where we can write the outputs. Or, we could write out the point-to-file mappings in a non-geospatial file format that supports streaming writes, like one JSON file per image metadata row, or using JSONL for a single file.
Okay, here's an idea to consider that might not require changing our data structure or what we store on disk: we separate the image identification and the image download into two steps.
First, we match each point to an image (without downloading any images). This will still take some time but I expect should be much faster. Then, this can get saved.
Then, we subsequently go through and download each image. If an image is already on locally available, then we can skip that image.
That sounds like it would help with Mapillary. What about if we are processing a local folder of images?
Is processing a local folder of images slow right now? I don't think we're moving around/copying the images right now, or anything like that.
i guess it took less than a minute for: 1090 points, 1286 images, 268 matches
Assigning Images to Points: 100%|███████| 1090/1090 [00:57<00:00, 19.10points/s]
if it stays that fast then something like 80,000 points div 20 points/s div 60 s/min is just over an hour, which is not fast but is reasonable
we need a way to interrupt and then resume both
assign_images
andassign_gvi_to_points
I am testing a somewhat large area in Indonesia and the projected time for it to complete is more than 60 hours. Are the features stored in the geopackage (or other geo file) being written/updated as the process runs, or does it happens once at the end? It would be fantastic to have a "resume" option so that we could interrupt the process, then restart it but with it ignoring the points for which we have already downloaded/fetched an image.