gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Improve handling of records declaring absence data #268

Closed timrobertson100 closed 4 years ago

timrobertson100 commented 4 years ago

Some datasets provide evidence of species absences. While this can be a difficult area to accommodate properly as modeling effort and confidence are required, there is a lot we can do to improve the current situation where consumers are given the burden of interpreting the data shared. In some cases, consumers will not have even enough information to detect this and will use absence records as presence records.

I propose we introduce the following:

  1. Introduce a search filter for occurrenceStatus in the occurrence search and download API and then expose it on the web site. We should review the data to determine if the current vocabulary is reasonable for the observed use in data. Where individualCount states 0 we should set occurrenceStatus = ABSENT if it is NULL and add a flag OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT to true. If occurrenceStatus is NULL we set it to PRESENT as a sensible default
  2. Add a flag for INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS setting it to true when the count is zero but the status declares it exists (could be several values) or when the count is >0 and it is declared as absent.
  3. Add flags for INDIVIDUAL_COUNT_UNPARSABLE and OCCURRENCE_STATUS_UNPARSABLE setting them appropriately when data cannot be parsed.
ahahn-gbif commented 4 years ago

Thanks - I also appreciate the conflict flags, as we are bound to have a few false positives caused by database default values and the like. A consideration on defaults (going back to earlier discussions): should true absence records be an opt-in for data users, i.e. filtered from view be default, and activated only on explicit request, similar to coordinates with known errors? I would expect that to be the most user-friendly option on the assumption that the majority of users would be looking for occurrences, not absences.

MattBlissett commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

MattBlissett commented 4 years ago

It might be useful to have a present or hasPresence or similar filter, in the same way we have hasCoordinate, which summarizes individualCount and occurrenceStatus.

timrobertson100 commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is important. If we are to refine this for occurrence data only, removing those terms that are targetting checklist use and possibly adding new ones, we need to

  1. Create a new enumeration in code
  2. Create a new vocabulary XML in rs.gbif.org
  3. Modify the occurrence and event core schemas to reference the new vocabulary
qgroom commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.

MattBlissett commented 4 years ago

There was previous discussion about absence records in these issues:

Those are good points, Quentin. Is there a term for recording abundance? I can't see one. The vast majority of data gives present/absent, but there is some giving abundance.

These are the verbatim values we have for occurrenceStatus with frequency > 1000:

occurrenceStatus count
\N 1047300464
-- --
present 190140103
Present 88470094
Présent 69978620
absent 10433299
P 1091847
Q 774046
Ne Sait Pas 321113
confirmed breeding 284635
established 256481
Presente 223434
stocked 215235
unknown 79623
complet 75906
presence 69363
Rare 1-4 56083
Presence 49337
probable breeding 48533
incomplet 43683
NA 41735
possible breeding 31631
Common 5-19 28557
Absent 26417
Confirmed Present 24652
Confirmed Breeding 20403
Abundant 20-99 20175
doubtful 19980
Possibly Breeding 17800
Común 17170
Probably Breeding 16677
1 15406
irregular 12661
Common 12554
rare 11010
Very abundant 100-499 7913
Occasional 7602
Abundant 7198
Rare (p < 1%) 6525
collected 6492
probably breeding 6402
possibly breeding 5863
Rare 5803
Present (1% <= p < 5%) 4940
Average Cover: 1-5% Maximum Cover: 1-5% 4127
unclear breeding certaint 3453
Very very abundant > 500 2813
Non observé 2710
Песня, голос 2378
Average Cover: 1-5% Maximum Cover: 6-25% 2015
Common (5% <= p < 10%) 1960
NT 1847
Dominant (20% <= p) 1274
Observed in Breeding Season 1266
Abundant (10% <= p < 20%) 1220
Average Cover: 76-95% Maximum Cover: 96-100% 1135
Reported 1084
Ausente 1055
Damaged 1051
Визуально 1003
timrobertson100 commented 4 years ago

It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

Thanks for raising this. If you look at the data you'll also find attempts to convey things like invasive, threatened etc which would be better elsewhere too.

Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.

The suggestion to add a flag for UNBOUNDED_ABSENCE seems sensible and pragmatic. I'm mindful that modeling absence can become more complex (e.g. quantifying likelihood of observation) which shouldn't be a restriction to improving usage of presence data.

Is there a term for recording abundance?

individualCount, organismQuantity and organismQuantityType?

MortenHofft commented 4 years ago

It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

@qgroom I understand present and absent, but what does doubtful and excluded mean for an individual occurrence?

albenson-usgs commented 4 years ago

Exciting! I hope it will be very clear to users that absence data ARE available. Sounds like it will be but just want to make sure. The P and Q are me (from a time before I was officially in charge of OBIS-USA), I'll make sure to get those corrected.

peterdesmet commented 4 years ago

Completely agree with what @timrobertson100 (how to parse it + flags) and @ahahn-gbif (exclude absences from views by default) suggest. Some notes:

  1. Some datasets provide organismQuantity and not individualCount. Will this be rolled into individualCount before assessment of individualCount = 0?
  2. Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?
  3. To allow differentiation of INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS and OCCURRENCE_STATUS_UNPARSABLE you will probably need to process the most occurring occurrenceStatus values that exist in the wild to either ABSENT, PRESENT (or not able to parse)?
ahahn-gbif commented 4 years ago

+1 for 1. - we should indeed look at organismQuantity as well

  1. Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?
    • If the individualCount is not 0, but NULL, occurrenceStatus = ABSENT is plausible We also have the opposite case, where
    • individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL
    • If the individualCount is not 0, but an actual (positive) value, and the occurrenceStatus = ABSENT, the flag would indeed make good sense - we will want to resolve that with publishers
albenson-usgs commented 4 years ago
  • individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL

For the datasets I work with, this would not be a good assumption to make. Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity so if individualCount = 0 but occurrenceStatus = PRESENT it means something went wrong in the code to create occurrenceStatus.

ahahn-gbif commented 4 years ago

Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity

Thanks, good point, I hadn't considered that. In that case, I agree it should get the same conflict flag as @peterdesmet suggested under point 3

peterdesmet commented 4 years ago

I agree, I would also prioritize individualCount over occurrenceStatus. Trying to summarize:

individualCount occurrenceStatus inferred occurrenceStatus flag
NULL NULL PRESENT
NULL present* PRESENT
NULL absent* ABSENT  
NULL rubbish PRESENT OCCURRENCE_STATUS_UNPARSABLE
>0 NULL PRESENT OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
>0 present* PRESENT
>0 absent* ABSENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
>0 rubbish PRESENT OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
0 NULL ABSENT OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
0 present* PRESENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
0 absent* ABSENT
0  rubbish ABSENT OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
rubbish  NULL PRESENT INDIVIDUAL_COUNT_UNPARSABLE
rubbish present* PRESENT INDIVIDUAL_COUNT_UNPARSABLE
rubbish absent* ABSENT INDIVIDUAL_COUNT_UNPARSABLE
rubbish rubbish PRESENT INDIVIDUAL_COUNT_UNPARSABLE, OCCURRENCE_STATUS_UNPARSABLE

*= or similar values

albenson-usgs commented 4 years ago

@peterdesmet I'm not understanding why this one would be flagged:

0 | absent* | ABSENT | OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

I would think that one shouldn't get a flag since things are all in agreement?

peterdesmet commented 4 years ago

@albenson-usgs it's a choice 🤷‍♂️: behind the scenes I would always infer from individualCount if that is available and not rubbish, but you could opt not to indicate it as such (OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT) if everything is in agreement.

timrobertson100 commented 4 years ago

I'm afraid I'd disagree.

I would propose only inferring if required, otherwise what is the point of the field? This would be similar to how we handle others e.g. decimalLatitude, decimalLongitude and country where country is only inferred if it is null or needs changed where other information is added explaining why.

Therefore I'd suggest:

individualCount occurrenceStatus interpretted occurrenceStatus flag
0 absent* ABSENT
peterdesmet commented 4 years ago

That's fine by me (have adapted in table above). But we still choose individualCount over occurrenceStatus if those disagree?

timrobertson100 commented 4 years ago

But we still choose individualCount over occurrenceStatus if those disagree?

Do you mean choose how to populate the interpreted occurrenceStatus field? I would suggest:

MortenHofft commented 4 years ago

But we still choose individualCount over occurrenceStatus if those disagree?

In support of Tim above (I think)

Normally we respect values that are there, but flag them as odd if they are in conflict with other values. E.g. coordinates: [in Paraguay] and country: Brazil would keep both country and coordinates but get an issue flag. If the country was missing it would be filled as Paraguay.

I would argue we do the same for individualCount, organismQuantity and occurrenceStatus. We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.

peterdesmet commented 4 years ago

@MortenHofft that makes sense, but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)?

MortenHofft commented 4 years ago

but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)

Yes. Just like we show null island, despite it probably being faulty data. If we consider it particular critical we can add an extra warning like we do on maps.

Screenshot 2020-05-11 at 14 41 38

It isn't that I want to make it difficult for users, I just think we will be in more trouble if we start to rewrite data. In time I'd rather that

  • Publishers fix data with conflicts
  • We update/improve/fix issues with the data validator (for pre publishing reports)
  • We add default values (or custom overwrites) for dataset on a case-by-case basis
  • We allow negations on issue filters
  • Make quality filters/reports a more prominent feature/filter in the UI
  • Allow community annotation/flagging
  • We provide clearer guidance on how the fields are to be used

It is a lot more work though :)

Here and now I like what MattBlissett mentions. Adding something similar to has_coordinate + has has_geospatial_issue that filter away those cases we consider critical - for the UI that might be the best option? But those are GBIF specific flags for easy filtering, without changing incoming data.

mdoering commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

I would argue the current occurrenceStatus vocabulary is more of an abundance vocabulary than a simple boolean. More like ACFOR: https://en.wikipedia.org/wiki/Abundance_(ecology), but including doubtful, absent & excluded.

I would prefer to create a new distribution status vocabulary to be used for species distribution checklists and shrink the existing occurrenceStatus one to be just present and absent like DwC suggests. Its probably also safer to change the distribution extension to point to a new vocabulary than changing the occurrence core to point to a new one.

albenson-usgs commented 4 years ago

Quick note just to say that occurrenceStatus is a required term for OBIS and only present or absent are accepted so this falls in line with what's been outlined here.

From the OBIS Manual: occurrenceStatus (required term) is a statement about the presence or absence of a taxon at a location. It is an important term, because it allows us to distinguish between presence and absence records. It is a required term and should be filled in with either present or absent.

peterdesmet commented 4 years ago

https://github.com/gbif/pipelines/issues/268#issuecomment-626581894:

We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.

Ok, that is clearer (even though individualCount might have more reliable information, see https://github.com/gbif/pipelines/issues/268#issuecomment-624715027). I have updated my table at https://github.com/gbif/pipelines/issues/268#issuecomment-624755278 (in italic) to reflect this decision.

MattBlissett commented 4 years ago

I've made the relevant changes in the GBIF schema sandbox, I think exactly as @mdoering suggests.

Is that reasonable for everyone?

peterdesmet commented 4 years ago

Hi all, the limited occurrenceStatus vocab with just present and absent for occurrences is an improvement. From the animal tracking community, we have been thinking about using doubtful to indicate outliers in the occurrence data, so that (discussion) might come up again in the future.

timrobertson100 commented 4 years ago

Thanks @peterdesmet

The discussion on the use of doubtful for occurrenceStatus is on this issue.

I propose we don't overload occurrenceStatus with doubtful as it raises the question whether it doubtfully present or doubtfully absent. A better option may be to model something about the confidence of the assertion in machine-based workflows (and indeed confidence of the identification for all records).

peterdesmet commented 4 years ago

@timrobertson100 true, although outliers can be seen as both doubtfully present and absent. But indeed has something to do with confidence and might be better explained or processed elsewhere:

timrobertson100 commented 4 years ago

I believe the plan here is to document a short spec based on the discussion above, to review before implementing - is that correct please @muttcg ?

muttcg commented 4 years ago

I believe the plan here is to document a short spec based on the discussion above, to review before implementing - is that correct please @muttcg ?

Yes, that is correct, no coding yet, I am refactoring some parts and also we need to migrate codebase to new Jackson gbif-api libs to be able to add new features, after that I will start to document solution

muttcg commented 4 years ago

@timrobertson100 1) Use OccurrenceStatus enum from org.gbif.api.vocabulary; 2) Add new filed to BasicRecord - occurrenceStatus 3) Add new vocabulary file and loader for it 4) Add new interpretatinon BasicInterpreter.interpretOccurrenceStatus 5) Add BasicInterpreter.interpretOccurrenceStatus step to BasicTransform. 6) Add OccurrenceStatus filed to es-occurrence-schema.json 7) Add new ES search and Hive visitor type

The only question I have is - Should we use a simple plain text file for the vocabulary or use new vocabulary project libs? Maybe we can use libs and file using the new vocabulary project, but don't include it in the editor.

I will create dummy BasicInterpreter.interpretOccurrenceStatus implementation, write tests for it (T.DD), create PR to review the results of the tests, and after getting approval I will implement the main logic.

timrobertson100 commented 4 years ago

Thanks @muttcg

Tests capturing the scenarios @peterdesmet compiled make sense to me, paying attention to the various flags raised.

The only question I have is - Should we use a simple plain text file for the vocabulary or use new vocabulary project libs? Maybe we can use libs and file using the new vocabulary project, but don't include it in the editor.

The only thing to be aware of is that format will likely change, which may be a nuisance. Either would work though, as it is a simple vocabulary.

Edited to add: Now that the vocabulary server is deployed in production for the proof of concept vocabulary, restricted only to a few people, and since this is a super simple vocabulary, I propose we simply edit this vocabulary in the server. @asturcon - would you agree?

marcos-lg commented 4 years ago

I propose we simply edit this vocabulary in the server. @asturcon - would you agree?

Right, agree.

I checked the verbatim values that we currently have for occurrenceStatus and they will map to our vocabulary as shown in this spreadsheet. I took the values that are present in at least 3 datasets or 1000 records. And I only mapped the values where the distinction between present and absent is clear.

I'll import these values into a vocabulary.

MattBlissett commented 4 years ago

We can accept many more values as present, and we should, since the old vocabulary encouraged some of them. I did this a couple of months ago for the old dictionary: https://github.com/gbif/parsers/blob/master/src/main/resources/dictionaries/parse/occurrence_status.tsv

muttcg commented 4 years ago

blocked by #325

timrobertson100 commented 4 years ago

In reviewing the code I spotted an error in the table.

individualCount occurrenceStatus inferred occurrenceStatus flag
0 present* PRESENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

Should read

individualCount occurrenceStatus inferred occurrenceStatus flag
0 present* PRESENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS

We are not inferring presence from an individualCount of 0, but we do want to raise that there is conflict.

MattBlissett commented 4 years ago

I think there's a second, similar error in the table:

individualCount occurrenceStatus inferred occurrenceStatus flag
>0 absent* ABSENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

Should instead be:

individualCount occurrenceStatus inferred occurrenceStatus flag
>0 absent* ABSENT INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
peterdesmet commented 4 years ago

Thanks for noticing, I have updated the table

muttcg commented 4 years ago

API and interpretation in production