fix: add LocalImages Image source to assign local images to points using EXIF data

Changes

Add pillow library to get EXIF location data from images
Move all image sources to images subpackage, add new abstract base class (interface) ImageSource which has the abstract method get_image_from_coordinates
Move Mapillary class from download_images.py to images/mapillary.py
Create new LocalImages class from ImageSource, which assigns local images to points using EXIF data
Add new image_source CLI argument, which currently can be MAPILLARY or LOCAL to use the Mapillary or LocalImages ImageSources respectively
Move logger to new utils.py, so that when I set verbosity in a CLI command, logs in mapillary.py or local_images.py will use the same verbosity. Tell me if this not the best way to achieve this.
Resolves #9 and resolves #40

Rev 2

Switch logging provider to loguru
Change Image Source Selector to an Enum
Remove duplicates caused by concurrency by switching back to a for loop
Add method docstrings

Rev 3

Switch to Google-style python docstrings
Remove unused variable declaration source
Log output GPKG file
Set a max_distance between point and image location

Testing

Running python -m src.download_images data/interim/Three_Rivers_Michigan_USA_points.gpkg LOCAL data/raw/images/ with a few of Dan's GoPro images from dropbox results in the following dataframe.head():

Getting 4960 Images: 100%|█████████████████████████████████████████████████████████████████████████████████████| 4960/4960 [00:00<00:00, 1024361.01images/s]
 osm_id                    geometry           image_id  image_lat  image_lon       residual                             image_path error
0  17967640  POINT (-85.65161 41.95364)  GS010002_0_001833  38.937746 -77.019965  804710.848539  data/raw/images/GS010002_0_001833.jpg  None
1  17967640  POINT (-85.65155 41.95379)  GS010002_0_001827  38.937739 -77.020165  804697.053492  data/raw/images/GS010002_0_001827.jpg  None
2  17967640  POINT (-85.65149 41.95394)  GS010002_0_000293  38.933553 -77.020617  804877.065164  data/raw/images/GS010002_0_000293.jpg  None
3  17967640  POINT (-85.65142 41.95408)  GS010002_0_000287  38.933418 -77.020584  804887.802313  data/raw/images/GS010002_0_000287.jpg  None
4  17964573  POINT (-85.65606 41.94811)  GS010002_0_001166  38.932609 -77.019582  805117.864109  data/raw/images/GS010002_0_001166.jpg  None

Note that I used the original Three_Rivers_Michigan_USA_points.gpkg for correlation with Dan's images. This is why the script still says it is processing 4960 points, and why the distance residual is so far away. I wasn't able to use Dan's corresponding geojson file, as it wasn't the exact same format when imported into the GeoDataFrames.

Thanks for working on this! I'm super excited that this will soon be an option.

should there be some inline comments like in create_point.py?
when running python -m src.download_images --help it doesn't tell you that the available options for the IMAGE_SOURCE argument are MAPILLARY and LOCAL - is there a way to include those options in the help command output?
when running the above help command it also says that the IMAGES_PATH default is /Users/danbjoseph/GitHub/americanredcross/street-view-green-view/data/raw/mapillary - is it setting a default folder named "mapillary" regardless of if the command is for local or mapillary? instead of a default should we just have it fail and notify the user that they need to set a destination path? if the user accidentally processes 2 different areas and uses the default folder for both, it could be a confusing mass of images.
something seems off with the image matching, see #43

the below is the output when i was testing - is printing dataframe.head() to console confusing for the user?


(.venv) ➜  street-view-green-view git:(9-optional-process-to-do-everything-locally) ✗ python -m src.download_images data/interim/ParkView_points.gpkg LOCAL data/images/ParkView/
Getting 1090 Images: 100%|██████████████| 1090/1090 [01:21<00:00, 13.31images/s]
    osm_id  ... error
0  way/6051455  ...  None
1  way/6051455  ...  None
2  way/6051455  ...  None
3  way/6051521  ...  None
4  way/6051521  ...  None

[5 rows x 9 columns]


6. what was the issue with my geojson file - what you said about "as it wasn't the exact same format when imported into the GeoDataFrames"? here is the interim file with the sampled points: [ParkView_points.gpkg.zip](https://github.com/AmericanRedCross/street-view-green-view/files/15276086/ParkView_points.gpkg.zip)

I'm working on a new revision addressing the above issues:

I am working on adding docstrings for classes and functions in this revision.
I have added the valid options in the help message of the argument.
Right now yes, the default folder name is data/raw/mapillary. The new revision will not have a default folder name and should error out if one is not provided.
After some testing, I realized that the duplicates are because of my usage of Python multiprocessing and concurrency. When the current thread filters out duplicates, another thread can take an image from the filtered images before the current thread assigns the closest image, thus resulting in duplicates. I think I will have to avoid concurrency and switch back to a simple for loop to avoid duplicates.
dataframe.head() is kind of the standard pandas way to get a snapshot of your dataframe. My run of the script resulted in all of the columns being shown above, but somehow your run of the script abbreviated the middle columns into .... It may have to do with how wide your terminal window was when you ran the script.
Ahh, I didn't realize I had to put it through create_points.py first, and ran it directly with download_images.py

@dragonejt I haven't had a time to review local_images.py in detail yet, but I thought I'd get these comments out since they might save you some work in response to Dan's review.

After some testing, I realized that the duplicates are because of my usage of Python multiprocessing and concurrency. When the current thread filters out duplicates, another thread can take an image from the filtered images before the current thread assigns the closest image, thus resulting in duplicates. I think I will have to avoid concurrency and switch back to a simple for loop to avoid duplicates.

I'm wondering if having logic to keep track of assigned images and skipping them is more complicated than necessary. We could alternatively just assign whichever image is closest to every point, and then deduplicate at the end, which may drop some points.

The main tradeoff between the two options is as you said, simplicity vs loss of some datapoints. There are also more questions to raised should the option of deduplicating by dropping points be OK, such as how many points are OK to drop. What happens if each point has a duplicate image, or multiple points have a single duplicate image, is it OK to end up with half our dataset gone?

Performance-wise, there should be a negligible difference between the two options. Although we could switch back to using concurrency and we would not have a set of assigned images taking up memory, the deduplication process would generally be at least an O(n) activity if not O(n^2). Performance isn't top of mind for us right now anyways.

Regardless, discussion for how we want to pursue deduplication moving forward is definitely out of scope for this PR. We should open up a new GH issue and see how we want to proceed with this. I'm going to leave the current deduplication method as-is for this PR.

I'm considering changing the max_distance to mini_dist to be more consistent with create_points.py

AmericanRedCross / street-view-green-view