Open n-a-t-e opened 2 years ago
Here are two solutions I can think of:
Remove profiles with > 1 coordinate This would lead to many of the DFO_MEDS_BUOYS buoys being removed (119)
Get distinct lat/longs for the profile, plot the first one found If the "First point found" is in a weird spot, the data provider should notice and fix the issue With this solution, if a user zoomed in very far they might not get all of the profile's data when they click on the point
After discussing wtih @JessyBarrette and @pramod-thupaki it sounds like dropping these bad data and encouraging the data providers to fix their data is the best way forward.
Instead of doing orderByMin
and orderByMax
for lat/lon, we could do this instead which is also much faster:
https://data.cioospacific.ca/erddap/tabledap/IOS_CTD_Profiles.htmlTable?profile,latitude,longitude&distinct()
Then we can filter out profiles with more than 1 coordinate
This CSV shows the profiles that are affected:
If it is faster. May as well. Though I would add the timeseries_id variable
Le lun. 9 mai 2022 4:44 p.m., Nate @.***> a écrit :
profiles_with_multiple_coordinates.csv https://github.com/HakaiInstitute/cde/files/8655208/profiles_with_multiple_coordinates.csv
— Reply to this email directly, view it on GitHub https://github.com/HakaiInstitute/cde/issues/262#issuecomment-1121563913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHICYOO7LL6IPXFVGGPKMKDVJF2L5ANCNFSM5VOZ7O7Q . You are receiving this because you were mentioned.Message ID: @.***>
If it is faster. May as well. Though I would add the timeseries_id variable
Yes, the example is of a profile so I used the profile variable, if it were a timeseries I would use the timeseries variable, or both if it were a TimeSeriesProfile
If the distance between points is very small, e.g. for buoy swing in a dataset where preciseLat isn't used, we can probably leave these datasets. @pramod-thupaki suggested 300m cutoff (if the distance between points is > 300m we remove that profile)
This will work most of the time and I think its fine for now.
This is the PRIMED buoy, it has a bit under 300m swing area, the swing points in blue I made up for now but it does really look similar to this, these "swing points" don't appear in CDE. The problem is if someone zooms in really really far and clicks download, they might expect to get all the data for the PRIMED buoy, but because the download area is a lat/long box related to the current zoom level, they might just get a fraction of whats available.
To make matters worse, if we are using the "first unique lat/long of the profile" to set the point in CDE, the red dot will would be more realistically be located where one of those blue dots are, not in the center
I am seeing now in CDE that downloading data from the PRIMED buoy fails when zoomed in. And likely gives a different amount of data based on the zoom level. This is probably a pretty widespread issue for "profiles" with >1 coordinate, that don't give us a nominal location
BTW .. the 300m cutoff is pretty arbitrary ... we need to find a better way of addressing buoys that have a large watch circle and where the data includes lat/lon as timeseries; One solution could be to make nominal lat/lon a required field - unlikely to work and we will loose a lot of data while these fields are missing; other solution could be to handle the representation and download as we would for trajectories - this would have visualization/UI implications
Also consider this situation, where user creates a download box that they think includes all of the PRIMED buoy data, but would end up with only a fraction of it. (Blue dots are swing area, not displayed to user)
I think this an issue that would be present within erddap. I don't think it's worth trying to fix it. That same issue would be reflected within the Explorer and download
Le mar. 10 mai 2022 4:29 p.m., Nate @.***> a écrit :
Also consider this situation, where user creates a download box that they think includes all of the PRIMED buoy data, but would end up with only a fraction of it. (Blue dots are swing area, not displayed to user) [image: Screen Shot 2022-05-10 at 1 27 45 PM] https://user-images.githubusercontent.com/26209011/167716219-a7555bff-7a22-41b6-901e-daa836591126.png
— Reply to this email directly, view it on GitHub https://github.com/HakaiInstitute/cde/issues/262#issuecomment-1122830219, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHICYONUYM5KDHOQHNWF6SLVJLBJZANCNFSM5VOZ7O7Q . You are receiving this because you were mentioned.Message ID: @.***>
I think I was afraid of using profile ID in the download because it wouldn't scale well in the front end, eg sending a list of a 10k profile IDs to download could get slow. But it miiight work in the backend, where, we take the shape the user has created, see what profiles' range touch that area, and then download from ERDDAP using those profile IDs instead of lat long.
Ok one issue with this solution is that most web servers have an 8k URL limit, so when we go to download from ERDDAP we would be limited to ~800 profiles per request. We could do multiple requests if we had to..
Station Papa alone has 6k profile IDs. so the downloader might take 10 requests to compile all this data (not that the user would know)
I think this an issue that would be present within erddap. I don't think it's worth trying to fix it. That same issue would be reflected within the Explorer and download Le mar. 10 mai 2022 4:29 p.m., Nate @.***> a écrit :
The difference is we've created a "nominal" single point of where this buoy is, and we've chosen it pretty arbitrarily or inaccurately so far (my bad).
Here's an example of how bad it is right now, the big circle is where we show the point in CDE (based wrongly on latitude_min and longitude_min), the little dots are the actual data points. The swing area is ~250m. Note that the big dot isn't even a point in the dataset
is this a project issue. As far as I know, we are getting access to a the lat/long min max values which gets converted to a specific value to be presented on the map. I' not sure what is that value? The average of the min/max lat/long?
Is it a projection issue otherwise?
is this a project issue. As far as I know, we are getting access to a the lat/long min max values which gets converted to a specific value to be presented on the map. I' not sure what is that value? The average of the min/max lat/long?
Not a projection issue- we are using the latitude_min and longitude_min as the point to plot in CDE, which might not be a point in the dataset. The "average" point would also not be a point in the dataset, so would have the same issue downloading when zoomed in, but it would be better for sure
Looking through that CSV I posted earlier with offending datasets:
Here's an example from the historic MEDS buoy data:
Buoy was deployed 7 times. Distance between first and last deployments is 121.5km. This info is from deployment metadata:
C44137 East Scotian Slope AE 19881130 19881208 41.32 61.35 4500 9 9 SCOTIAN SHELF
C44137 East Scotian Slope AE 19890908 19930622 41.19 61.13 4500 1383 1306 SCOTIAN SHELF
C44137 East Scotian Slope AE 19930703 19950628 41.23 61.42 4500 725 641 SCOTIAN SHELF
C44137 East Scotian Slope AE 19950630 19951102 41.60 60.03 4500 125 87 SCOTIAN SHELF
C44137 East Scotian Slope AE 19960922 19971015 41.65 59.92 4500 388 268 SCOTIAN SHELF
C44137 East Scotian Slope AW 19980625 20030219 41.83 60.94 4500 1700 953 SCOTIAN SHELF
C44137 East Scotian Slope 6N 20030605 20220517 42.28 62.00 4000 6921 5247 SCOTIAN SHELF
So where do we show the C44137 buoy in CDE?
The "fix" I am implementing now is to use this metadata to set latitude
and longitude
in the ERDDAP dataset, and then create new preciseLat
/preciseLon
columns in the source data. We will end up with a point for each deployment, so there will be 7 "C44137"'s on the map, with names like C44137_19881130
.
Without correcting the latitude/longitude based on deployment metadata, there would be 10 lat/long points in the source data
I have updated the MEDS buoys historic dataset, DFO_MEDS_BUOYS
, (from CSVs) to use deployment metadata so that multiple lat/lon is no longer an issue for this dataset. Note that this dataset is also called 'Realtime' but I think I will change that now that all buoys use the newer system below. The newer system originally only supported a handful of buoys but now seems to have all of the buoys that are currently reporting, no historic data. This fixes an issue some users had where they get a different amount of data depending on zoom level. See https://data.cioospacific.ca/erddap/tabledap/DFO_MEDS_BUOYS.html
ECCC_MSC_BUOYS
uses ECCC's newer 'SWOB' system and uses the nominal lat/long for latitude and longitude, while providing crnt_buoy_lat
, crnt_buoy_lon
for measured values. So multiple lat/lon isn't an issue in this dataset. See https://data.cioospacific.ca/erddap/tabledap/ECCC_MSC_BUOYS.html
Right now we plot a single point on the map per every profile we get from an ERDDAP dataset. There should usually be a single coordinate per profile, or if they set the latitude/longitude from a GPS the multiple points will be usually close enough together that its not an issue.
The problem is that we get the lat/lon range for the profile, and then put the lat_min and lon_min together to make a coordinate, but its possible that this coordinate doesnt exist in the profile!
This is an extreme example where theres an error in the data and the profile has 4 points that are very far apart:
This results is a point in CDE near lake cowichan, which comes from the combined latitude_min and longitude_min of these four points.
We have 348 profiles in 14 datasets where the latitude or longitude range is > .1 degrees apart, which isn't really that many, and most of these are probably due to errors in the source data.
This bug would affect downloading, if a user tries to download one of these 348 profiles they could get an empty download, or a download with data missing