dragonejt commented 4 months ago

Changes

Add Mapillary image downloader script that downloads an image from coordinates (a shapely Point).
Add GeoPandas dataframe parser class that can extract a list of Points from the .gpkg GeoDataFrame file outputted by create_points.py
Replace requirements.txt with a dependencies section in pyproject.toml, update all installation commands to refer to pyproject.toml
Works towards #7

Testing

Running python -m src.mapillary "[MAPILLARY_CLIENT_TOKEN]" [POINTS_FILE] [IMAGE_PATH] succeeds on the Three_Rivers_Michigan_USA_points.gpkg generated by create_points.py and downloads images to the selected image path

Risks

The Mapillary API sometimes returns a query timeout exception (HTTP 400), which resolves if that query is tried again a few more times. Sample Query

{
"error": {
    "message": "Timeout",
    "type": "OAuthException",
    "code": -2,
    "error_subcode": 3404014,
    "is_transient": false,
    "error_user_title": "Query Timeout",
    "error_user_msg": "The query timed out, please retry the query",
    "fbtrace_id": "A6w92zlxmQCqBiboCh9K8te"
}
}

I have added 3 retries to each Mapillary API call, but in large lists of coordinates I am still seeing the query timeout exceptions.

danbjoseph commented 4 months ago

i added a few things in a docs section for this step, please edit as needed.
when getting setup i ran into some issues. however, i don't think it's related to any changes introduced in this PR. see #19
can you write the image file name to each point in a new field? and also record if there wasn't an image found or the API request failed?
is there any other metadata available from Mapillary for the image (or from the image search) that we should record?

dragonejt commented 4 months ago

I commented on bullet number 2 in the related issue.

So the mapillary downloader should also write a file called image_data.json to the same folder. That image_data.json contains a list of the following:

{
"latitude": 41.93223076003517, // latitude of requested point from create_points.py
"longitude": -85.65572922086078, // longitude of requested point from create_points.py
"image_id": "332637408194005" | null, // First image id from mapillary API for requested point, null if no images found in that area or API request failed (timeout)
"image_path": "data/raw/mapillary/332637408194005.jpeg" | null // Downloaded image path, null if no images found in that area or API request failed (timeout)
}

Is this what you wanted for bullet number 3?

This is the list of metadata we can get about an image, I'm not sure what metadata we would want to add.

danbjoseph commented 4 months ago

ah, i interrupted the process before it fetched every image, so it didn't write the json. for much larger point sets, will it be a problem that it is storing everything for the json in memory until the end and writing it all at once?

at the end of the whole process, i think it will help the user to troubleshoot and give them more flexibility in analysis if more info is in one place inside the geopackage. right now each point in the geopackage has fid and osm_id properties - the osm_id being the identifier of the road segment the point was sampled from. i'd like to add an image_id property. can we also write img_lat and img_lon properties? since the sampled point and the image. location will likely be different by some meters and i'm not yet sure in the analysis and visualization phase which coordinates we will want to use. i think we can ignore the other metadata from Mapillary at this time. thank you.

dragonejt commented 4 months ago

Yeah the image_data.json will not appear if it is interrupted before finishing. We could maybe try/catch any KeyboardInterrupt (Ctrl+C) to write the image_data.json before actually exiting.

Currently I extract the list of points (longitude/latitude) from the geopackage before sending them to the mapillary client, so by the time they get to the mapillary client they already don't have access to the original geopackage. Let me think of a way to send the entire geopackage over.

danbjoseph commented 4 months ago

@jayqi do you have any suggestions for how to track each point through this step so that the image details can be added back to the correct point feature in the geodata?

dragonejt commented 4 months ago

ah, i interrupted the process before it fetched every image, so it didn't write the json. for much larger point sets, will it be a problem that it is storing everything for the json in memory until the end and writing it all at once?

at the end of the whole process, i think it will help the user to troubleshoot and give them more flexibility in analysis if more info is in one place inside the geopackage. right now each point in the geopackage has fid and osm_id properties - the osm_id being the identifier of the road segment the point was sampled from. i'd like to add an image_id property. can we also write img_lat and img_lon properties? since the sampled point and the image. location will likely be different by some meters and i'm not yet sure in the analysis and visualization phase which coordinates we will want to use. i think we can ignore the other metadata from Mapillary at this time. thank you.

Actually, this may influence the separate GeoPandasParser vs put it all in the same function debate. Did you want the Mapillary Image downloader to write to a new (next-step) file, or update the same file?

danbjoseph commented 4 months ago

[edited based on @jayqi response below] @dragonejt ~it can~ write a new file ~or update the same file,~ but ~in either case~ it does need to carry through data from the previous file.

jayqi commented 4 months ago

I strongly recommend that we write a new file. The workflow is much easier to reason about and reproduce if we treat the data outputs at each step as immutable, and the overall workflow as a directed acyclic graph (DAG). Here's some discussion about why this is a useful framing.

dragonejt commented 4 months ago

[edited based on @jayqi response below] @dragonejt ~it can~ write a new file ~or update the same file,~ but ~in either case~ it does need to carry through data from the previous file.

So I did some experimentation with this, and GeoDataFrames only support having one column with geospatial data. This means that I cannot add the coordinates of the downloaded image as another column in the GeoDataFrame. Adding the image id and image path work fine as they are both strings. Do we want to ignore the coordinates of the image then, or encode it as strings/numbers?

The error I was getting is ValueError: GeoDataFrame contains multiple geometry columns but GeoDataFrame.to_file supports only a single geometry column. Use a GeoDataFrame.to_parquet or GeoDataFrame.to_feather, drop additional geometry columns or convert them to a supported format like a well-known text (WKT) usingGeoSeries.to_wkt().

danbjoseph commented 4 months ago

It also appears that multiple coordinates will have the same closest image from the Mapillary API, is this something that we are OK with?

Looks like you edited to remove this? We don't want to use an image more than once.

In our Mapillary query we should be restricting the search to within 10m of the sample point and the sample points are 20 m apart. However, they are 20m apart along the road line features (not a plane) and roads may curve or 2 roads may be closer than 20m. PXL_20240306_140530383

Regardless of the above, because of the complexities of recording points on the lumpy, curved surface of the Earth and projecting to a 2D representation - I don't think we could rely on the bounding box to eliminate the possibilities of duplicate images. I reprojected to EPSG:32616, WGS 84, UTM 16N (given the location of our test data),

So that I could do geo-spatial functions in meters instead of degrees... and in addition to overlaps of the 10m buffers of each point at intersections and where roads are otherwise close to each other, the 20m as measured in the createPoints step isn't 20m when calculated with this method (see the slight overlap between all adjacent points).

Can you calculate the distance from the sample point to the image point as you download the images. Then when you are done go through and for each image key that is matched to more than one sample point, throw out the image key for all but the sample point + image key with the shortest distance? The coordinates are in EPSG:4326, WGS84 used in GPS which is global and has a unit of degrees. For more definitive distance measurements we would need to transform to a projection such as EPSG:32618, WGS84 / UTM zone 18N (okay for DC) and which has a unit of meters. But that's a pain and we don't know which projection to use each time as we are making a global tool. It is possible to calculate distance in EPSG:4326, it is not super accurate over long distances but for deciding which pair of points is closest, I think it should be fine.

geopy can do this, see: https://geopy.readthedocs.io/en/latest/#module-geopy.distance
this guy might be doing it in pandas? https://www.tjansson.dk/2021/03/vectorized-gps-distance-speed-calculation-for-pandas/

For example, in the below mock-up (ignore the NULL values in the img_lat and img_lon columns - those would be decimal values) it would be after downloading all the images. Then for all the rows that have img_file=a.png it would look at the dist value (the calculated number of meters between the sample point and the image point) and delete the img attributes for rows 1 and 3 (or somehow mark them as a duplicate). Then do the same for img_file=c.png keeping the img details for row 5 (the smaller distance). And so on...?

For the image point, can you encode it as a string/numbers? A WKT format string would look like POINT(0 0) and would be one column, or it could be separate string values in img_lat and img_lon columns.

dragonejt commented 4 months ago

Looks like you edited to remove this? We don't want to use an image more than once.

For that, I tested again and there was an issue with my code, now it only uses one image per set of coordinates, and returns nothing if there isn't an image within that bounding box. I can take a look at finding the closest image, since currently it returns all images within the bounding box, and doesn't sort them by closeness to the original point.

dragonejt commented 4 months ago

Hey Dan, so the current way I'm avoiding to have the same picture for multiple points is to store a set of all downloaded image IDs while I'm downloading, and filtering out the existing images before searching for the closest image. I think this is a simpler method and avoids the two passes, However, with this method, while that image may be the closest unique image for a point, that point may not be the closest point to that image. Is this method OK, or do you want to ensure that an image is tied to its closest point (probably requiring a second pass)?

danbjoseph commented 4 months ago

Evan, if you've got that method working, let's stick with it and continue on towards a completed MVP. We can log an issue to remember to take another look at the method later. Thanks!

dragonejt commented 3 months ago

Implementation changes have been committed, will do documentation changes sometime later

danbjoseph commented 3 months ago

first glance looks good!

i was worried because it seemed like we were missing a lot of images (the white dots on top of the blue represent sample points with no associated image)

but then i filtered for 360 images on Mapillary and it looks like the coverage lines up. there's a gap in the 360 coverage in the north half of the town.

danbjoseph commented 3 months ago

i think using a .env.example file makes it less likely someone accidentally commits their token, have added that change

dragonejt commented 3 months ago

I'll work on documentation changes today

AmericanRedCross / street-view-green-view

feat: Add Mapillary Image downloader, add GeoPandas Parser #18

Changes

Testing

Risks