Improve handling of records declaring absence data

timrobertson100 commented 4 years ago

Some datasets provide evidence of species absences. While this can be a difficult area to accommodate properly as modeling effort and confidence are required, there is a lot we can do to improve the current situation where consumers are given the burden of interpreting the data shared. In some cases, consumers will not have even enough information to detect this and will use absence records as presence records.

I propose we introduce the following:

Introduce a search filter for occurrenceStatus in the occurrence search and download API and then expose it on the web site. We should review the data to determine if the current vocabulary is reasonable for the observed use in data. Where individualCount states 0 we should set occurrenceStatus = ABSENT if it is NULL and add a flag OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT to true. If occurrenceStatus is NULL we set it to PRESENT as a sensible default
Add a flag for INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS setting it to true when the count is zero but the status declares it exists (could be several values) or when the count is >0 and it is declared as absent.
Add flags for INDIVIDUAL_COUNT_UNPARSABLE and OCCURRENCE_STATUS_UNPARSABLE setting them appropriately when data cannot be parsed.

ahahn-gbif commented 4 years ago

Thanks - I also appreciate the conflict flags, as we are bound to have a few false positives caused by database default values and the like. A consideration on defaults (going back to earlier discussions): should true absence records be an opt-in for data users, i.e. filtered from view be default, and activated only on explicit request, similar to coordinates with known errors? I would expect that to be the most user-friendly option on the assumption that the majority of users would be looking for occurrences, not absences.

MattBlissett commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

MattBlissett commented 4 years ago

It might be useful to have a present or hasPresence or similar filter, in the same way we have hasCoordinate, which summarizes individualCount and occurrenceStatus.

timrobertson100 commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is important. If we are to refine this for occurrence data only, removing those terms that are targetting checklist use and possibly adding new ones, we need to

Create a new enumeration in code
Create a new vocabulary XML in rs.gbif.org
Modify the occurrence and event core schemas to reference the new vocabulary

qgroom commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.

MattBlissett commented 4 years ago

There was previous discussion about absence records in these issues:

https://github.com/gbif/portal-feedback/issues/1851
https://github.com/gbif/portal16/issues/308 (most discussion, input from Donald and others)
https://github.com/gbif/portal-feedback/issues/1206 -- an observation that without this, AdHoc maps do not show the same data as processed maps.
https://github.com/gbif/portal-feedback/issues/1405
https://github.com/gbif/occurrence/issues/94

Those are good points, Quentin. Is there a term for recording abundance? I can't see one. The vast majority of data gives present/absent, but there is some giving abundance.

These are the verbatim values we have for occurrenceStatus with frequency > 1000:

occurrenceStatus	count
\N	1047300464
--	--
present	190140103
Present	88470094
Présent	69978620
absent	10433299
P	1091847
Q	774046
Ne Sait Pas	321113
confirmed breeding	284635
established	256481
Presente	223434
stocked	215235
unknown	79623
complet	75906
presence	69363
Rare 1-4	56083
Presence	49337
probable breeding	48533
incomplet	43683
NA	41735
possible breeding	31631
Common 5-19	28557
Absent	26417
Confirmed Present	24652
Confirmed Breeding	20403
Abundant 20-99	20175
doubtful	19980
Possibly Breeding	17800
Común	17170
Probably Breeding	16677
1	15406
irregular	12661
Common	12554
rare	11010
Very abundant 100-499	7913
Occasional	7602
Abundant	7198
Rare (p < 1%)	6525
collected	6492
probably breeding	6402
possibly breeding	5863
Rare	5803
Present (1% <= p < 5%)	4940
Average Cover: 1-5% Maximum Cover: 1-5%	4127
unclear breeding certaint	3453
Very very abundant > 500	2813
Non observé	2710
Песня, голос	2378
Average Cover: 1-5% Maximum Cover: 6-25%	2015
Common (5% <= p < 10%)	1960
NT	1847
Dominant (20% <= p)	1274
Observed in Breeding Season	1266
Abundant (10% <= p < 20%)	1220
Average Cover: 76-95% Maximum Cover: 96-100%	1135
Reported	1084
Ausente	1055
Damaged	1051
Визуально	1003

timrobertson100 commented 4 years ago

It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

Thanks for raising this. If you look at the data you'll also find attempts to convey things like invasive, threatened etc which would be better elsewhere too.

Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.

The suggestion to add a flag for UNBOUNDED_ABSENCE seems sensible and pragmatic. I'm mindful that modeling absence can become more complex (e.g. quantifying likelihood of observation) which shouldn't be a restriction to improving usage of presence data.

Is there a term for recording abundance?

individualCount, organismQuantity and organismQuantityType?

MortenHofft commented 4 years ago

It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

@qgroom I understand present and absent, but what does doubtful and excluded mean for an individual occurrence?

albenson-usgs commented 4 years ago

Exciting! I hope it will be very clear to users that absence data ARE available. Sounds like it will be but just want to make sure. The P and Q are me (from a time before I was officially in charge of OBIS-USA), I'll make sure to get those corrected.

peterdesmet commented 4 years ago

Completely agree with what @timrobertson100 (how to parse it + flags) and @ahahn-gbif (exclude absences from views by default) suggest. Some notes:

Some datasets provide organismQuantity and not individualCount. Will this be rolled into individualCount before assessment of individualCount = 0?
Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?
To allow differentiation of INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS and OCCURRENCE_STATUS_UNPARSABLE you will probably need to process the most occurring occurrenceStatus values that exist in the wild to either ABSENT, PRESENT (or not able to parse)?

ahahn-gbif commented 4 years ago

+1 for 1. - we should indeed look at organismQuantity as well

Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?

If the individualCount is not 0, but NULL, occurrenceStatus = ABSENT is plausible We also have the opposite case, where

individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL

If the individualCount is not 0, but an actual (positive) value, and the occurrenceStatus = ABSENT, the flag would indeed make good sense - we will want to resolve that with publishers

albenson-usgs commented 4 years ago

individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL

For the datasets I work with, this would not be a good assumption to make. Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity so if individualCount = 0 but occurrenceStatus = PRESENT it means something went wrong in the code to create occurrenceStatus.

ahahn-gbif commented 4 years ago

Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity

Thanks, good point, I hadn't considered that. In that case, I agree it should get the same conflict flag as @peterdesmet suggested under point 3

peterdesmet commented 4 years ago

I agree, I would also prioritize individualCount over occurrenceStatus. Trying to summarize:

individualCount	occurrenceStatus	inferred occurrenceStatus	flag
NULL	NULL	PRESENT
NULL	present*	PRESENT
NULL	absent*	ABSENT
NULL	rubbish	PRESENT	OCCURRENCE_STATUS_UNPARSABLE
>0	NULL	PRESENT	OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
>0	present*	PRESENT
>0	absent*	ABSENT	INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
>0	rubbish	PRESENT	OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
0	NULL	ABSENT	OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
0	present*	PRESENT	INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
0	absent*	ABSENT
0	rubbish	ABSENT	OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
rubbish	NULL	PRESENT	INDIVIDUAL_COUNT_UNPARSABLE
rubbish	present*	PRESENT	INDIVIDUAL_COUNT_UNPARSABLE
rubbish	absent*	ABSENT	INDIVIDUAL_COUNT_UNPARSABLE
rubbish	rubbish	PRESENT	INDIVIDUAL_COUNT_UNPARSABLE, OCCURRENCE_STATUS_UNPARSABLE

*= or similar values

albenson-usgs commented 4 years ago

@peterdesmet I'm not understanding why this one would be flagged:

0 | absent* | ABSENT | OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

I would think that one shouldn't get a flag since things are all in agreement?

peterdesmet commented 4 years ago

@albenson-usgs it's a choice 🤷‍♂️: behind the scenes I would always infer from individualCount if that is available and not rubbish, but you could opt not to indicate it as such (OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT) if everything is in agreement.

timrobertson100 commented 4 years ago

I'm afraid I'd disagree.

I would propose only inferring if required, otherwise what is the point of the field? This would be similar to how we handle others e.g. decimalLatitude, decimalLongitude and country where country is only inferred if it is null or needs changed where other information is added explaining why.

Therefore I'd suggest:

individualCount	occurrenceStatus	interpretted occurrenceStatus	flag
0	absent*	ABSENT

peterdesmet commented 4 years ago

That's fine by me (have adapted in table above). But we still choose individualCount over occurrenceStatus if those disagree?

timrobertson100 commented 4 years ago

But we still choose individualCount over occurrenceStatus if those disagree?

Do you mean choose how to populate the interpreted occurrenceStatus field? I would suggest:

interpret the value supplied
infer ABSENCE if it is NULL but other fields (ìndividualCount or organismQuantity) imply that, or assume PRESENT (adding a flag to state as much)
flag the record if there is conflicting information (e.g. individualCount=666 occurrenceStatus=ABSENT)

MortenHofft commented 4 years ago

But we still choose individualCount over occurrenceStatus if those disagree?

In support of Tim above (I think)

Normally we respect values that are there, but flag them as odd if they are in conflict with other values. E.g. coordinates: [in Paraguay] and country: Brazil would keep both country and coordinates but get an issue flag. If the country was missing it would be filled as Paraguay.

I would argue we do the same for individualCount, organismQuantity and occurrenceStatus. We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.

peterdesmet commented 4 years ago

@MortenHofft that makes sense, but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)?

MortenHofft commented 4 years ago

but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)

Yes. Just like we show null island, despite it probably being faulty data. If we consider it particular critical we can add an extra warning like we do on maps.

~~It isn't that I want to make it difficult for users, I just think we will be in more trouble if we start to rewrite data. In time I'd rather that~~

Publishers fix data with conflicts

We update/improve/fix issues with the data validator (for pre publishing reports)

We add default values (or custom overwrites) for dataset on a case-by-case basis

We allow negations on issue filters

Make quality filters/reports a more prominent feature/filter in the UI

Allow community annotation/flagging

We provide clearer guidance on how the fields are to be used

~~It is a lot more work though :)~~

Here and now I like what MattBlissett mentions. Adding something similar to has_coordinate + has has_geospatial_issue that filter away those cases we consider critical - for the UI that might be the best option? But those are GBIF specific flags for easy filtering, without changing incoming data.

mdoering commented 4 years ago

We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml

This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.

I would argue the current occurrenceStatus vocabulary is more of an abundance vocabulary than a simple boolean. More like ACFOR: https://en.wikipedia.org/wiki/Abundance_(ecology), but including doubtful, absent & excluded.

I would prefer to create a new distribution status vocabulary to be used for species distribution checklists and shrink the existing occurrenceStatus one to be just present and absent like DwC suggests. Its probably also safer to change the distribution extension to point to a new vocabulary than changing the occurrence core to point to a new one.

albenson-usgs commented 4 years ago

Quick note just to say that occurrenceStatus is a required term for OBIS and only present or absent are accepted so this falls in line with what's been outlined here.

From the OBIS Manual: occurrenceStatus (required term) is a statement about the presence or absence of a taxon at a location. It is an important term, because it allows us to distinguish between presence and absence records. It is a required term and should be filled in with either present or absent.

peterdesmet commented 4 years ago

https://github.com/gbif/pipelines/issues/268#issuecomment-626581894:

We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.

Ok, that is clearer (even though individualCount might have more reliable information, see https://github.com/gbif/pipelines/issues/268#issuecomment-624715027). I have updated my table at https://github.com/gbif/pipelines/issues/268#issuecomment-624755278 (in italic) to reflect this decision.

MattBlissett commented 4 years ago

I've made the relevant changes in the GBIF schema sandbox, I think exactly as @mdoering suggests.

http://rs.gbif.org/sandbox/vocabulary/gbif/distribution_status_2020-05-13.xml , which is the old occurrence status vocabulary with a new name. It was only used by the distribution extension.
http://rs.gbif.org/sandbox/extension/gbif/1.0/distribution_2020-05-13.xml , which is a new version of the distribution extension, referring to the new location for the vocabulary. Note the term is still dwc:occurrenceStatus.
http://rs.gbif.org/sandbox/vocabulary/gbif/occurrence_status_2020-05-13 which is a new vocabulary for occurrenceStatus, with only the values present and absent
http://rs.gbif.org/sandbox/core/dwc_occurrence_2020-05-13.xml which is a new Occurrence core, using this occurrenceStaus vocabulary. (The previous occurrence core did not define a vocabulary for occurrenceStatus.)

Is that reasonable for everyone?

peterdesmet commented 4 years ago

Hi all, the limited occurrenceStatus vocab with just present and absent for occurrences is an improvement. From the animal tracking community, we have been thinking about using doubtful to indicate outliers in the occurrence data, so that (discussion) might come up again in the future.

timrobertson100 commented 4 years ago

Thanks @peterdesmet

The discussion on the use of doubtful for occurrenceStatus is on this issue.

I propose we don't overload occurrenceStatus with doubtful as it raises the question whether it doubtfully present or doubtfully absent. A better option may be to model something about the confidence of the assertion in machine-based workflows (and indeed confidence of the identification for all records).

peterdesmet commented 4 years ago

@timrobertson100 true, although outliers can be seen as both doubtfully present and absent. But indeed has something to do with confidence and might be better explained or processed elsewhere:

confidence about location (for GPS tracking outliers or large coordinateUncertaintyInMeters)
confidence of actually being an occurrence (ghost detections in acoustic telemetry) -> a true doubtfully present
confidence about identification (e.g. in citizen science records)

timrobertson100 commented 4 years ago

I believe the plan here is to document a short spec based on the discussion above, to review before implementing - is that correct please @muttcg ?

muttcg commented 4 years ago

I believe the plan here is to document a short spec based on the discussion above, to review before implementing - is that correct please @muttcg ?

Yes, that is correct, no coding yet, I am refactoring some parts and also we need to migrate codebase to new Jackson gbif-api libs to be able to add new features, after that I will start to document solution

muttcg commented 4 years ago

@timrobertson100 1) Use OccurrenceStatus enum from org.gbif.api.vocabulary; 2) Add new filed to BasicRecord - occurrenceStatus 3) Add new vocabulary file and loader for it 4) Add new interpretatinon BasicInterpreter.interpretOccurrenceStatus 5) Add BasicInterpreter.interpretOccurrenceStatus step to BasicTransform. 6) Add OccurrenceStatus filed to es-occurrence-schema.json 7) Add new ES search and Hive visitor type

The only question I have is - Should we use a simple plain text file for the vocabulary or use new vocabulary project libs? Maybe we can use libs and file using the new vocabulary project, but don't include it in the editor.

I will create dummy BasicInterpreter.interpretOccurrenceStatus implementation, write tests for it (T.DD), create PR to review the results of the tests, and after getting approval I will implement the main logic.

timrobertson100 commented 4 years ago

Thanks @muttcg

Tests capturing the scenarios @peterdesmet compiled make sense to me, paying attention to the various flags raised.

The only question I have is - Should we use a simple plain text file for the vocabulary or use new vocabulary project libs? Maybe we can use libs and file using the new vocabulary project, but don't include it in the editor.

The only thing to be aware of is that format will likely change, which may be a nuisance. Either would work though, as it is a simple vocabulary.

Edited to add: Now that the vocabulary server is deployed in production for the proof of concept vocabulary, restricted only to a few people, and since this is a super simple vocabulary, I propose we simply edit this vocabulary in the server. @asturcon - would you agree?

marcos-lg commented 4 years ago

I propose we simply edit this vocabulary in the server. @asturcon - would you agree?

Right, agree.

I checked the verbatim values that we currently have for occurrenceStatus and they will map to our vocabulary as shown in this spreadsheet. I took the values that are present in at least 3 datasets or 1000 records. And I only mapped the values where the distinction between present and absent is clear.

I'll import these values into a vocabulary.

MattBlissett commented 4 years ago

We can accept many more values as present, and we should, since the old vocabulary encouraged some of them. I did this a couple of months ago for the old dictionary: https://github.com/gbif/parsers/blob/master/src/main/resources/dictionaries/parse/occurrence_status.tsv

muttcg commented 4 years ago

blocked by #325

timrobertson100 commented 4 years ago

In reviewing the code I spotted an error in the table.

individualCount	occurrenceStatus	inferred occurrenceStatus	flag
0	present*	PRESENT	INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

Should read

individualCount	occurrenceStatus	inferred occurrenceStatus	flag
0	present*	PRESENT	INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS

We are not inferring presence from an individualCount of 0, but we do want to raise that there is conflict.

MattBlissett commented 4 years ago

I think there's a second, similar error in the table:

individualCount	occurrenceStatus	inferred occurrenceStatus	flag
>0	absent*	ABSENT	INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT

Should instead be:

individualCount	occurrenceStatus	inferred occurrenceStatus	flag
>0	absent*	ABSENT	INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS

peterdesmet commented 4 years ago

Thanks for noticing, I have updated the table

muttcg commented 4 years ago

API and interpretation in production

gbif / pipelines

Improve handling of records declaring absence data #268