gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Incorrectly flipping coordinates #479

Open timrobertson100 opened 3 years ago

timrobertson100 commented 3 years ago

These records in the sea here are having coordinates flipped by GBIF incorrectly. Example Puma which is flipping due to the country being Argentina while our geocode (and google maps) would put it in Paraguay.

I don't know what the best thing to do here is, but what we do now is not good. We could consider not flipping coordinates but flagging them and omitting from maps as we once did, not flipping coordinates when the target is an island, machine tagging datasets to omit flipping... other ideas?

jhnwllr commented 3 years ago

Happens when close to the border + country has weird mirrored EEZ or territory. Seems like there should be away to buffer the country borders and find the most common weird mirrored areas...

users have found a few: https://github.com/gbif/portal-feedback/issues/3207 https://github.com/gbif/portal-feedback/issues/3264 https://github.com/gbif/portal-feedback/issues/3171

timrobertson100 commented 3 years ago

Thanks @jhnwllr . I suspect it happens when it is not close to the border and there are EEZ/territories in play like the location of the Puma above. Border cases can be handled easier I think by giving the benefit of the doubt and trusting the record.

MattBlissett commented 3 years ago

I think there are several places for investigation:

We should probably reduce the possibilities by allowing either a coordinate transformation (except for very large countries) or an ISO code equivalent match, and maybe push towards data improvement at source by not hiding issues in e.g. the FR/RE case.

MattBlissett commented 3 years ago

Although the dataset with the most swapped coordinates is in Russia: https://www.gbif.org/occurrence/map?dataset_key=df9dcbc9-f3d1-40e2-9011-02babb0d98f6&issue=PRESUMED_SWAPPED_COORDINATE and the second in the USA: https://www.gbif.org/occurrence/map?dataset_key=37eeb5de-fec3-404a-ba4f-62218ac6c860&issue=PRESUMED_SWAPPED_COORDINATE

abubelinha commented 1 year ago

New example of useless data due to incorrect GBIF interpretations. https://www.gbif.org/occurrence/617734368 (and many more from the same provider)

Here a chain of errors happened:

  1. Dataset contains mostly Spanish and Portuguese vegetation plots (many species observations per plot). But there are also some plots from other countries. The above example is part of a group of Italian plots that were incorrectly set with 'ES' countryCode. The rest of the original information was correct (coordinates, location, whatever).
  2. GBIF detected those coordinates were not in Spain. But instead of correcting the wrong country, GBIF "uncorrected" the correct coordinates.
  3. GBIF used the new "uncorrected" coordinates to assign gADM levels 0-1-2-3 to the occurrence. So now it almost looks like correct stuff because all the major geography-related and machine-readable fields are coherent. Except for the original verbatim location (I was so lucky to catch this: I don't pay attention to this field because I cannot use it for mapping occurrences).

These automatisms are introducing lots of errors in provided information. There was one erroneus field. Now there are 4 (and the original one still remains).

Please stop doing this ASAP: deactivate until it works properly. It is extremely dangerous, as it corrupts other regions' biodiversity data. The example is one species in a plot; there are many plots in a table and many tables in a publication. I am afraid many plant records might have been wrongly transposed from Italian to Iberian Peninsula.

Leave users the responsibility of checking coordinates outside countries. Changing values to make data look coherent, errors become harder to catch.

I don't see a reason for this to be happening as GBIF detected it years ago. It is such a bad thing. I agree with @timrobertson100 propose in the original post: Just flag things, but don't change them, please.

Thanks! @abubelinha

timrobertson100 commented 1 year ago

Thanks for taking the time to provide detail @abubelinha - I can imagine discovering this case was frustrating.

For a bit of background - originally, GBIF did only flag records, and around N (TODO - I'll find the code update) we changed the behavior. If you look at this blog you can see just how problematic things were back in 2011. Because people were downloading data, ignoring the flags (or using formats without them) they were getting frustrated, and we received feedback along the lines of "well, if you can detect it, please just change it". I believe at the time it was probably a good decision, as an analysis of the data at that time indicated it did more good than harm.

Fast forward to today. New validation tools, real-time indexing where people view their data within minutes, a more mature data-sharing community, training etc., likely all adds up to the situation where data is probably far cleaner than it once was (744k records affected). It is a good time to reevaluate, bearing in mind have there are better tools to help detect issues during data publishing.

I think we could approach this in one of at least three ways:

  1. improve the algorithm. e.g. only swapping where there is sufficient evidence to justify it such as with an additional check against a stated stateProvince or locality
  2. revert to only flagging the data (may then upset some users who don't pay attention to the issue flags in the the download, or who have got used to the behaviour)
  3. nullify the coordinates if they are sufficiently distant from the stated country (buffering needed to accommodate approximations) or detected as suspicious

In the meantime, I will write to the publisher of this dataset too.

CC @ahahn-gbif

abubelinha commented 1 year ago

Thanks for answering @timrobertson100 I like your suggested approaches, just in reversed order:

Approach 3 shouldn't bother anyone: just nullify incoherent data. I think this could be a fair, educative, quick and easy solution. It's much better not having coordinates than having them wrong (no matter whose fault it is). I would suggest you to nullify them AND moving to verbatimXXX fields (unless they are already there too) with some flag (i.e. INCOHERENT_DATA_MOVED_TO_VERBATIM), so users can still track what the original information was. Actually, if you nullify both country and coordinates when they don't match, that would probably stimulate providers to pay more attention to these issues and correct them next time (nobody likes their data to disappear from both map and country searches/downloads, but nobody can blame GBIF for removing them if they don't match).

Approach 2: don't change things, just flag them
I see your point here, but IMHO GBIF shouldn't be more afraid of upsetting users (who don't pay attention to GBIF flags), than of upsetting providers (when GBIF data cleaning becomes data corruption). Everybody should pay attention to issue flags.
Regarding feedback asking to "change data if you detect errors", I'd suggest different alternatives:

Approach 1: improve protocol and check stateProvince Sounds interesting but looks difficult to me: stateProvince is not a standardized DwC term ... how would you check coordinates against that? (not to say against locality). Also this would only be useful when that stateProvince field is provided (not in the above example, where locality was clearly an italian place but only for a human reader, not for a computer). I am more in favor of GBIF letting users to incorporate standardized GADM fields in the original dataset as suggested in https://github.com/gbif/portal-feedback/issues/4012. In that case, perhaps stateProvince, county or municipality could be used for this coordinate validation purpose whenever they are filled in with valid GADM values? I also suggest those GADM values should be useful for map searches even if coordinates are not provided (GADM are polygons, after all ... but that's another issue). However GBIF implements it, I think your 1st approach would take time to happen. And I think this is an urgent issue that should be solved.

For now, I think approach 3 is the easiest and more correct. Also the most educative one for data providers.


Regarding your last comment, bothering publishers is something I wouldn't do. They already can see issues flagged in their datasets. It's their own responsibility to look at those flags and try to correct data next time, if they can.

I wouldn't like my comments to cause GBIF sending any messages to providers. Please don't do that for me, as I can do that on my own (I've already contacted this publisher in the past: it was a project finished years ago so dataset corrections are unlikely to happen). But that's not the point.

My comment was not about the provider's error in countryCode. All providers (me too) have a number of records wrongly georreferenced in one way or another. Users (me too) can easily filter out those errors: that's what gbif issue flags are for, and they are great stuff.

The real point is that going beyond flags can cause much worse issues when assuming "if coordinates don't match country, then coordinates are wrong" (that's not publisher but GBIF's fault). In the above example there were enough geographical evidences of error being in country and not in coordinates. If GBIF's protocol is not smart enough to catch this, then it shouldn't be used.

Forgive me if I sound too harsh but it surprised me a lot to find the origin of this error was in GBIF itself. I would forget about notifying providers of particular wrong occurrence examples and concentrate in stopping these "wrong correction issues" to happen.

BTW, thanks a lot for providing the blog example. Are we authorized to reuse their contents? (including images). I am preparing a derived occurrence-based checklist dataset and those images in blog post will be so useful to illustrate why some gbif's flagged records must be removed from downloaded data.

Regards @abubelinha

timrobertson100 commented 1 year ago

Thanks again @abubelinha

BTW, thanks a lot for providing the blog example. Are we authorized to reuse their contents? (including images).

Yes, but please bear in mind that was written more than a decade ago, and the GBIF systems and community content have progressed since then; for background, it may be helpful though. With a little notice, we can provide data or images showing the current state of the problem if that helps.

abubelinha commented 1 year ago

Thanks @timrobertson100

Just to clarify why I say this is an urgent issue: my above example is not an outlier. This is a "massive GBIF wrong longitude correction". I have just checked to confirm my suspects.

47302 records (2,76%) of this big dataset are flagged PRESUMED_NEGATED_LONGITUDE and show a negative longitude instead of the originally provided.

I have been personally checking a few (~20) random occurrences (by changing the offset parameter in above url) and I couldn't find a single one where GBIF flag and correction were right. In all cases, the original coordinate was the correct one, and portal is currently showing these data switched from East to West hemisphere.

That's what I suspected because I knew these coordinates were not latitude/longitude in the original dataset: they have all been converted from MGRS grid references so the "presumed negated longitude" flags made no sense to me from the beginning.

I bet in most cases the problem was a wrongly provided country (ES instead of FR, IT, AD, PT). Probably a confusion with publishingCountry.

What puzzles me a lot is how GBIF longitude sign conversion happened even when flapped coordinates fall in deep sea. See this big bounding box subset (~25%) ) of the 47302 presumed mistakes.

In those cases, GBIF correction caused also a new issue, changing from COUNTRY_COORDINATE_MISMATCH to CONTINENT_COORDINATE_MISMATCH and at least 25% of GBIF corrections sent vascular plant occurrences from land to sea. That might well indicate something being wrong with those corrections.

ahahn-gbif commented 1 year ago

Thank you for this discussion. Obviously we do not want to introduce mis-corrections that worsen a publisher's original data.

For damage control, I tend to support option 3 (nullify coordinates in a case of country/coordinate mismatches). At the same time, we need more exploration to understand how many currently appropriately corrected georeferences this change is going to lose, if only to evaluate the investment into more nuanced handling in future, as sketched by @MattBlissett at the start of this thread. There may be more to consider in future, like taxon list/shape file combinations for marine areas, currently not available.

I agree that we can more and more rely on data publishers to get actively involved in data improvement at source. This does not apply to all, though. Wanting to encourage that engagement, centrally taking the correction of inconsistent data over makes follow-up less obviously necessary. On the other hand, changing the processing rules by dropping mismatched coordinates entirely will necessarily cause a, possibly significant, drop in available georeferenced records where the auto-correction so far has been doing more good than harm. @jhnwllr, could I consult with you on that when you are back, please?

The verbatim/original values would remain available anyway, at least in downloads of a full archive. Asking publishers to sign up for auto-correction may be a good idea as such, but will not not solve the immediate problem at hand (mis-correction); it may raise awareness that attention to flags is needed, though. Ultimately, cases like the example will need to be handled by the publisher to correct; we do want to avoid misrepresenting the data meanwhile.

While we try to understand the impact, both on users and publishers, and on either side of auto-correction yes or no: would it be viable to take a "light" approach of opting datasets out of this handling (rather than in), and possibly do this through some externally applied tags, as soon as some issue is spotted? It is hard to quantify these cases overall. If misdiagnosed datasets could be taken out of the auto-correction near-immediately, we would not prevent such cases from happening, but could still deal with them as quickly as possible, while also building a case library of interpretation issues that we would, in the longer run, want to be able to handle better/adequately (?)

MattBlissett commented 1 year ago

I think the majority of the corrections are valid, but there are particular areas where there will be more mistakes. Argentina is one, as there are matching EEZs in the Atlantic that hit a transposed version of the country:

swapped_collision

but anywhere within a few degrees of the Equator or 0° / 180° longitudes will also be risky, e.g. France, Spain, Britain. We could set up exclusions -- prevent swapping if the country is in (Argentina, Brazil, Russia, USA, etc) or flipping sign if the coordinate is <5° or so.

But I think I would prefer to work towards Tim's second suggestion:

revert to only flagging the data (may then upset some users who don't pay attention to the issue flags in the the download, or who have got used to the behaviour)

Most other interpretation steps only flag, they don't correct. Taking Benin as an example, there are already almost 11,000 records with 'country coordinate mismatch', scattered all over the world -- users need to handle this, or accept the suggestion to remove these records from their search: https://www.gbif.org/occurrence/map?country=BJ&has_coordinate=true&issue=COUNTRY_COORDINATE_MISMATCH

There are 17,000 records we've (probably...) fixed: https://www.gbif.org/occurrence/map?country=BJ&has_coordinate=true&issue=PRESUMED_SWAPPED_COORDINATE&issue=PRESUMED_NEGATED_LONGITUDE&issue=PRESUMED_NEGATED_LATITUDE (less a few we swapped because the latitude was > 90).

I think the surprise or upset would be for publishers, who might have seen their data was fixed during interpretation and chosen not to correct it.

We are discussion 910,463 occurrences from 2117 datasets: https://www.gbif.org/occurrence/search?issue=PRESUMED_SWAPPED_COORDINATE&issue=PRESUMED_NEGATED_LONGITUDE&issue=PRESUMED_NEGATED_LATITUDE

I'm preparing some graphics like in the 2011 blog post.

MattBlissett commented 1 year ago

See verbatim occurrence coordinates for the unprocessed coordinates for each country.

Compare the 2011 view of the USA:

USA 2011 verbatim

With the 2023 view:

USA 2023 verbatim

timrobertson100 commented 1 year ago

These are incredibly helpful @MattBlissett. This API call lists how many records there are per country.

It looks like GBIF.org today probably adds pretty good value to US and countries that lie far from the 0,0 lines. In some of the Brasil cases, it looks useful (where the longitude is positive).

220k records in Canada from 226k come from this dataset alone which we should probably fix - we generate that from their API in agreement with them so probably have the means to fix it ourselves.

Perhaps we should review some of the largest datasets individually too, especially if they are orphaned?

ahahn-gbif commented 1 year ago

Thanks Matt! Very impressive indeed.

I am starting to wonder (and this is just me thinking out loud): is it worth considering an alternative display for dataset / publisher maps, showing the records as-are?

On integrated search results and global maps, the individual contributions are so mixed in with the rest that there is no trigger to action implied. Individual dataset pages are more exposed that way.

We have a similar effect with higher taxonomy, where an expectation of GBIF "adding this anyway" has held publishers back, in at least some known cases, from providing information on higher ranks with their data, sometimes leading to matching errors as a consequence.

ahahn-gbif commented 1 year ago

Thanks for the suggestion, Tim!

I guess I am bit more nasty, in suggesting that dataset pages in GBIF.org could show the unadulterated records. In part because not every publisher uses the IPT; in other part because this is the place (I would expect) most publishers to check early in the process, after publishing a dataset.

Improved options for checking beforehand, of course, will be great, and many publishers do make use of the validator tool, some with great diligence - we will certainly want to keep supporting and rewarding that. For others, maybe showing the actual status in the context where it would encourage action could raise awareness?

timrobertson100 commented 1 year ago

Update on this:

The Canadian dataset is now fixed and has 5830 records down from 226k yesterday.

Matt has identified an issue linked above, where we are not flagging records as suspicious when we swap coordinates as intended. That would have least meant we were flagging it suspicious that we (incorrectly) swapped them into Spain, and many users do ignore suspicious records.

(this is just to report progress, but the original issue of GBIF incorrectly swapping coordinates remains)

MattBlissett commented 1 year ago

Matt has identified an issue linked above, where we are not flagging records as suspicious when we swap coordinates as intended. That would have least meant we were flagging it suspicious that we (incorrectly) swapped them into Spain, and many users do ignore suspicious records.

This is now fixed, and about 250 datasets are being reprocessed. Something like 1750 have already been reprocessed.

https://www.gbif.org/occurrence/search?offset=0&has_coordinate=true&has_geospatial_issue=false&issue=PRESUMED_SWAPPED_COORDINATE&issue=PRESUMED_NEGATED_LONGITUDE&issue=PRESUMED_NEGATED_LATITUDE gives 351k results now, it should gradually reduce to zero.