AmericanRedCross / street-view-green-view

BSD 3-Clause "New" or "Revised" License
15 stars 15 forks source link

fix: add LocalImages Image source to assign local images to points using EXIF data #42

Closed dragonejt closed 3 months ago

dragonejt commented 4 months ago

Changes

Rev 2

Rev 3

Testing

danbjoseph commented 4 months ago

Thanks for working on this! I'm super excited that this will soon be an option.

  1. should there be some inline comments like in create_point.py?
  2. when running python -m src.download_images --help it doesn't tell you that the available options for the IMAGE_SOURCE argument are MAPILLARY and LOCAL - is there a way to include those options in the help command output?
  3. when running the above help command it also says that the IMAGES_PATH default is /Users/danbjoseph/GitHub/americanredcross/street-view-green-view/data/raw/mapillary - is it setting a default folder named "mapillary" regardless of if the command is for local or mapillary? instead of a default should we just have it fail and notify the user that they need to set a destination path? if the user accidentally processes 2 different areas and uses the default folder for both, it could be a confusing mass of images.
  4. something seems off with the image matching, see #43
  5. the below is the output when i was testing - is printing dataframe.head() to console confusing for the user?
    
    (.venv) ➜  street-view-green-view git:(9-optional-process-to-do-everything-locally) ✗ python -m src.download_images data/interim/ParkView_points.gpkg LOCAL data/images/ParkView/
    Getting 1090 Images: 100%|██████████████| 1090/1090 [01:21<00:00, 13.31images/s]
        osm_id  ... error
    0  way/6051455  ...  None
    1  way/6051455  ...  None
    2  way/6051455  ...  None
    3  way/6051521  ...  None
    4  way/6051521  ...  None

[5 rows x 9 columns]


6. what was the issue with my geojson file - what you said about "as it wasn't the exact same format when imported into the GeoDataFrames"? here is the interim file with the sampled points: [ParkView_points.gpkg.zip](https://github.com/AmericanRedCross/street-view-green-view/files/15276086/ParkView_points.gpkg.zip)
dragonejt commented 4 months ago

I'm working on a new revision addressing the above issues:

  1. I am working on adding docstrings for classes and functions in this revision.
  2. I have added the valid options in the help message of the argument.
  3. Right now yes, the default folder name is data/raw/mapillary. The new revision will not have a default folder name and should error out if one is not provided.
  4. After some testing, I realized that the duplicates are because of my usage of Python multiprocessing and concurrency. When the current thread filters out duplicates, another thread can take an image from the filtered images before the current thread assigns the closest image, thus resulting in duplicates. I think I will have to avoid concurrency and switch back to a simple for loop to avoid duplicates.
  5. dataframe.head() is kind of the standard pandas way to get a snapshot of your dataframe. My run of the script resulted in all of the columns being shown above, but somehow your run of the script abbreviated the middle columns into .... It may have to do with how wide your terminal window was when you ran the script.
  6. Ahh, I didn't realize I had to put it through create_points.py first, and ran it directly with download_images.py
dragonejt commented 4 months ago

@dragonejt I haven't had a time to review local_images.py in detail yet, but I thought I'd get these comments out since they might save you some work in response to Dan's review.

After some testing, I realized that the duplicates are because of my usage of Python multiprocessing and concurrency. When the current thread filters out duplicates, another thread can take an image from the filtered images before the current thread assigns the closest image, thus resulting in duplicates. I think I will have to avoid concurrency and switch back to a simple for loop to avoid duplicates.

I'm wondering if having logic to keep track of assigned images and skipping them is more complicated than necessary. We could alternatively just assign whichever image is closest to every point, and then deduplicate at the end, which may drop some points.

The main tradeoff between the two options is as you said, simplicity vs loss of some datapoints. There are also more questions to raised should the option of deduplicating by dropping points be OK, such as how many points are OK to drop. What happens if each point has a duplicate image, or multiple points have a single duplicate image, is it OK to end up with half our dataset gone?

Performance-wise, there should be a negligible difference between the two options. Although we could switch back to using concurrency and we would not have a set of assigned images taking up memory, the deduplication process would generally be at least an O(n) activity if not O(n^2). Performance isn't top of mind for us right now anyways.

Regardless, discussion for how we want to pursue deduplication moving forward is definitely out of scope for this PR. We should open up a new GH issue and see how we want to proceed with this. I'm going to leave the current deduplication method as-is for this PR.

dragonejt commented 4 months ago

I'm considering changing the max_distance to mini_dist to be more consistent with create_points.py