HakaiInstitute / cde

https://explore.cioos.ca
0 stars 1 forks source link

Handling profiles with more than one coordinate #262

Open n-a-t-e opened 2 years ago

n-a-t-e commented 2 years ago

Right now we plot a single point on the map per every profile we get from an ERDDAP dataset. There should usually be a single coordinate per profile, or if they set the latitude/longitude from a GPS the multiple points will be usually close enough together that its not an issue.

The problem is that we get the lat/lon range for the profile, and then put the lat_min and lon_min together to make a coordinate, but its possible that this coordinate doesnt exist in the profile!

This is an extreme example where theres an error in the data and the profile has 4 points that are very far apart: Screen Shot 2022-05-09 at 8 30 48 AM

This results is a point in CDE near lake cowichan, which comes from the combined latitude_min and longitude_min of these four points.

We have 348 profiles in 14 datasets where the latitude or longitude range is > .1 degrees apart, which isn't really that many, and most of these are probably due to errors in the source data.

This bug would affect downloading, if a user tries to download one of these 348 profiles they could get an empty download, or a download with data missing

n-a-t-e commented 2 years ago

Here are two solutions I can think of:

n-a-t-e commented 2 years ago

After discussing wtih @JessyBarrette and @pramod-thupaki it sounds like dropping these bad data and encouraging the data providers to fix their data is the best way forward.

n-a-t-e commented 2 years ago

Instead of doing orderByMin and orderByMax for lat/lon, we could do this instead which is also much faster: https://data.cioospacific.ca/erddap/tabledap/IOS_CTD_Profiles.htmlTable?profile,latitude,longitude&distinct()

Then we can filter out profiles with more than 1 coordinate

n-a-t-e commented 2 years ago

This CSV shows the profiles that are affected:

profiles_with_multiple_coordinates.csv

JessyBarrette commented 2 years ago

If it is faster. May as well. Though I would add the timeseries_id variable

Le lun. 9 mai 2022 4:44 p.m., Nate @.***> a écrit :

profiles_with_multiple_coordinates.csv https://github.com/HakaiInstitute/cde/files/8655208/profiles_with_multiple_coordinates.csv

— Reply to this email directly, view it on GitHub https://github.com/HakaiInstitute/cde/issues/262#issuecomment-1121563913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHICYOO7LL6IPXFVGGPKMKDVJF2L5ANCNFSM5VOZ7O7Q . You are receiving this because you were mentioned.Message ID: @.***>

n-a-t-e commented 2 years ago

If it is faster. May as well. Though I would add the timeseries_id variable

Yes, the example is of a profile so I used the profile variable, if it were a timeseries I would use the timeseries variable, or both if it were a TimeSeriesProfile

n-a-t-e commented 2 years ago

If the distance between points is very small, e.g. for buoy swing in a dataset where preciseLat isn't used, we can probably leave these datasets. @pramod-thupaki suggested 300m cutoff (if the distance between points is > 300m we remove that profile)

This will work most of the time and I think its fine for now.

This is the PRIMED buoy, it has a bit under 300m swing area, the swing points in blue I made up for now but it does really look similar to this, these "swing points" don't appear in CDE. The problem is if someone zooms in really really far and clicks download, they might expect to get all the data for the PRIMED buoy, but because the download area is a lat/long box related to the current zoom level, they might just get a fraction of whats available.

To make matters worse, if we are using the "first unique lat/long of the profile" to set the point in CDE, the red dot will would be more realistically be located where one of those blue dots are, not in the center

Screen Shot 2022-05-10 at 1 04 34 PM
n-a-t-e commented 2 years ago

I am seeing now in CDE that downloading data from the PRIMED buoy fails when zoomed in. And likely gives a different amount of data based on the zoom level. This is probably a pretty widespread issue for "profiles" with >1 coordinate, that don't give us a nominal location

pramod-thupaki commented 2 years ago

BTW .. the 300m cutoff is pretty arbitrary ... we need to find a better way of addressing buoys that have a large watch circle and where the data includes lat/lon as timeseries; One solution could be to make nominal lat/lon a required field - unlikely to work and we will loose a lot of data while these fields are missing; other solution could be to handle the representation and download as we would for trajectories - this would have visualization/UI implications

n-a-t-e commented 2 years ago

Also consider this situation, where user creates a download box that they think includes all of the PRIMED buoy data, but would end up with only a fraction of it. (Blue dots are swing area, not displayed to user)

Screen Shot 2022-05-10 at 1 27 45 PM
JessyBarrette commented 2 years ago

I think this an issue that would be present within erddap. I don't think it's worth trying to fix it. That same issue would be reflected within the Explorer and download

Le mar. 10 mai 2022 4:29 p.m., Nate @.***> a écrit :

Also consider this situation, where user creates a download box that they think includes all of the PRIMED buoy data, but would end up with only a fraction of it. (Blue dots are swing area, not displayed to user) [image: Screen Shot 2022-05-10 at 1 27 45 PM] https://user-images.githubusercontent.com/26209011/167716219-a7555bff-7a22-41b6-901e-daa836591126.png

— Reply to this email directly, view it on GitHub https://github.com/HakaiInstitute/cde/issues/262#issuecomment-1122830219, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHICYONUYM5KDHOQHNWF6SLVJLBJZANCNFSM5VOZ7O7Q . You are receiving this because you were mentioned.Message ID: @.***>

n-a-t-e commented 2 years ago

I think I was afraid of using profile ID in the download because it wouldn't scale well in the front end, eg sending a list of a 10k profile IDs to download could get slow. But it miiight work in the backend, where, we take the shape the user has created, see what profiles' range touch that area, and then download from ERDDAP using those profile IDs instead of lat long.

Ok one issue with this solution is that most web servers have an 8k URL limit, so when we go to download from ERDDAP we would be limited to ~800 profiles per request. We could do multiple requests if we had to..

Station Papa alone has 6k profile IDs. so the downloader might take 10 requests to compile all this data (not that the user would know)

n-a-t-e commented 2 years ago

I think this an issue that would be present within erddap. I don't think it's worth trying to fix it. That same issue would be reflected within the Explorer and download Le mar. 10 mai 2022 4:29 p.m., Nate @.***> a écrit :

The difference is we've created a "nominal" single point of where this buoy is, and we've chosen it pretty arbitrarily or inaccurately so far (my bad).

Here's an example of how bad it is right now, the big circle is where we show the point in CDE (based wrongly on latitude_min and longitude_min), the little dots are the actual data points. The swing area is ~250m. Note that the big dot isn't even a point in the dataset

Screen Shot 2022-05-10 at 1 41 46 PM
JessyBarrette commented 2 years ago

is this a project issue. As far as I know, we are getting access to a the lat/long min max values which gets converted to a specific value to be presented on the map. I' not sure what is that value? The average of the min/max lat/long?

JessyBarrette commented 2 years ago

Is it a projection issue otherwise?

n-a-t-e commented 2 years ago

is this a project issue. As far as I know, we are getting access to a the lat/long min max values which gets converted to a specific value to be presented on the map. I' not sure what is that value? The average of the min/max lat/long?

Not a projection issue- we are using the latitude_min and longitude_min as the point to plot in CDE, which might not be a point in the dataset. The "average" point would also not be a point in the dataset, so would have the same issue downloading when zoomed in, but it would be better for sure

n-a-t-e commented 2 years ago

Looking through that CSV I posted earlier with offending datasets:

n-a-t-e commented 2 years ago

Here's an example from the historic MEDS buoy data:

Buoy was deployed 7 times. Distance between first and last deployments is 121.5km. This info is from deployment metadata:

 C44137     East Scotian Slope     AE   19881130  19881208    41.32    61.35   4500       9        9       SCOTIAN SHELF       
 C44137     East Scotian Slope     AE   19890908  19930622    41.19    61.13   4500    1383     1306       SCOTIAN SHELF       
 C44137     East Scotian Slope     AE   19930703  19950628    41.23    61.42   4500     725      641       SCOTIAN SHELF       
 C44137     East Scotian Slope     AE   19950630  19951102    41.60    60.03   4500     125       87       SCOTIAN SHELF       
 C44137     East Scotian Slope     AE   19960922  19971015    41.65    59.92   4500     388      268       SCOTIAN SHELF       
 C44137     East Scotian Slope     AW   19980625  20030219    41.83    60.94   4500    1700      953       SCOTIAN SHELF       
 C44137     East Scotian Slope     6N   20030605  20220517    42.28    62.00   4000    6921     5247       SCOTIAN SHELF       

So where do we show the C44137 buoy in CDE?

The "fix" I am implementing now is to use this metadata to set latitude and longitude in the ERDDAP dataset, and then create new preciseLat/preciseLon columns in the source data. We will end up with a point for each deployment, so there will be 7 "C44137"'s on the map, with names like C44137_19881130.

Without correcting the latitude/longitude based on deployment metadata, there would be 10 lat/long points in the source data

n-a-t-e commented 2 years ago

I have updated the MEDS buoys historic dataset, DFO_MEDS_BUOYS, (from CSVs) to use deployment metadata so that multiple lat/lon is no longer an issue for this dataset. Note that this dataset is also called 'Realtime' but I think I will change that now that all buoys use the newer system below. The newer system originally only supported a handful of buoys but now seems to have all of the buoys that are currently reporting, no historic data. This fixes an issue some users had where they get a different amount of data depending on zoom level. See https://data.cioospacific.ca/erddap/tabledap/DFO_MEDS_BUOYS.html

ECCC_MSC_BUOYS uses ECCC's newer 'SWOB' system and uses the nominal lat/long for latitude and longitude, while providing crnt_buoy_lat, crnt_buoy_lon for measured values. So multiple lat/lon isn't an issue in this dataset. See https://data.cioospacific.ca/erddap/tabledap/ECCC_MSC_BUOYS.html