Closed dragonejt closed 3 months ago
Thanks for working on this! I'm super excited that this will soon be an option.
create_point.py
?python -m src.download_images --help
it doesn't tell you that the available options for the IMAGE_SOURCE
argument are MAPILLARY
and LOCAL
- is there a way to include those options in the help command output?IMAGES_PATH
default is /Users/danbjoseph/GitHub/americanredcross/street-view-green-view/data/raw/mapillary
- is it setting a default folder named "mapillary" regardless of if the command is for local or mapillary? instead of a default should we just have it fail and notify the user that they need to set a destination path? if the user accidentally processes 2 different areas and uses the default folder for both, it could be a confusing mass of images.dataframe.head()
to console confusing for the user?
(.venv) ➜ street-view-green-view git:(9-optional-process-to-do-everything-locally) ✗ python -m src.download_images data/interim/ParkView_points.gpkg LOCAL data/images/ParkView/
Getting 1090 Images: 100%|██████████████| 1090/1090 [01:21<00:00, 13.31images/s]
osm_id ... error
0 way/6051455 ... None
1 way/6051455 ... None
2 way/6051455 ... None
3 way/6051521 ... None
4 way/6051521 ... None
[5 rows x 9 columns]
6. what was the issue with my geojson file - what you said about "as it wasn't the exact same format when imported into the GeoDataFrames"? here is the interim file with the sampled points: [ParkView_points.gpkg.zip](https://github.com/AmericanRedCross/street-view-green-view/files/15276086/ParkView_points.gpkg.zip)
I'm working on a new revision addressing the above issues:
data/raw/mapillary
. The new revision will not have a default folder name and should error out if one is not provided.dataframe.head()
is kind of the standard pandas way to get a snapshot of your dataframe. My run of the script resulted in all of the columns being shown above, but somehow your run of the script abbreviated the middle columns into ...
. It may have to do with how wide your terminal window was when you ran the script.create_points.py
first, and ran it directly with download_images.py
@dragonejt I haven't had a time to review
local_images.py
in detail yet, but I thought I'd get these comments out since they might save you some work in response to Dan's review.After some testing, I realized that the duplicates are because of my usage of Python multiprocessing and concurrency. When the current thread filters out duplicates, another thread can take an image from the filtered images before the current thread assigns the closest image, thus resulting in duplicates. I think I will have to avoid concurrency and switch back to a simple for loop to avoid duplicates.
I'm wondering if having logic to keep track of assigned images and skipping them is more complicated than necessary. We could alternatively just assign whichever image is closest to every point, and then deduplicate at the end, which may drop some points.
The main tradeoff between the two options is as you said, simplicity vs loss of some datapoints. There are also more questions to raised should the option of deduplicating by dropping points be OK, such as how many points are OK to drop. What happens if each point has a duplicate image, or multiple points have a single duplicate image, is it OK to end up with half our dataset gone?
Performance-wise, there should be a negligible difference between the two options. Although we could switch back to using concurrency and we would not have a set of assigned images taking up memory, the deduplication process would generally be at least an O(n) activity if not O(n^2). Performance isn't top of mind for us right now anyways.
Regardless, discussion for how we want to pursue deduplication moving forward is definitely out of scope for this PR. We should open up a new GH issue and see how we want to proceed with this. I'm going to leave the current deduplication method as-is for this PR.
I'm considering changing the max_distance
to mini_dist
to be more consistent with create_points.py
Changes
pillow
library to get EXIF location data from imagesimages
subpackage, add new abstract base class (interface)ImageSource
which has the abstract methodget_image_from_coordinates
Mapillary
class fromdownload_images.py
toimages/mapillary.py
LocalImages
class fromImageSource
, which assigns local images to points using EXIF dataimage_source
CLI argument, which currently can beMAPILLARY
orLOCAL
to use theMapillary
orLocalImages
ImageSource
s respectivelyutils.py
, so that when I set verbosity in a CLI command, logs inmapillary.py
orlocal_images.py
will use the same verbosity. Tell me if this not the best way to achieve this.Rev 2
Rev 3
source
max_distance
between point and image locationTesting
python -m src.download_images data/interim/Three_Rivers_Michigan_USA_points.gpkg LOCAL data/raw/images/
with a few of Dan's GoPro images from dropbox results in the followingdataframe.head()
:Three_Rivers_Michigan_USA_points.gpkg
for correlation with Dan's images. This is why the script still says it is processing 4960 points, and why the distance residual is so far away. I wasn't able to use Dan's corresponding geojson file, as it wasn't the exact same format when imported into the GeoDataFrames.