Closed timrobertson100 closed 4 years ago
Thanks - I also appreciate the conflict flags, as we are bound to have a few false positives caused by database default values and the like. A consideration on defaults (going back to earlier discussions): should true absence records be an opt-in for data users, i.e. filtered from view be default, and activated only on explicit request, similar to coordinates with known errors? I would expect that to be the most user-friendly option on the assumption that the majority of users would be looking for occurrences, not absences.
We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml
It might be useful to have a present
or hasPresence
or similar filter, in the same way we have hasCoordinate
, which summarizes individualCount
and occurrenceStatus
.
We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml
This is important. If we are to refine this for occurrence data only, removing those terms that are targetting checklist use and possibly adding new ones, we need to
We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml
This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if occurrenceStatus
was just present, absent, doubtful and excluded.
Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.
There was previous discussion about absence records in these issues:
Those are good points, Quentin. Is there a term for recording abundance? I can't see one. The vast majority of data gives present/absent, but there is some giving abundance.
These are the verbatim values we have for occurrenceStatus with frequency > 1000:
occurrenceStatus | count |
---|---|
\N | 1047300464 |
-- | -- |
present | 190140103 |
Present | 88470094 |
Présent | 69978620 |
absent | 10433299 |
P | 1091847 |
Q | 774046 |
Ne Sait Pas | 321113 |
confirmed breeding | 284635 |
established | 256481 |
Presente | 223434 |
stocked | 215235 |
unknown | 79623 |
complet | 75906 |
presence | 69363 |
Rare 1-4 | 56083 |
Presence | 49337 |
probable breeding | 48533 |
incomplet | 43683 |
NA | 41735 |
possible breeding | 31631 |
Common 5-19 | 28557 |
Absent | 26417 |
Confirmed Present | 24652 |
Confirmed Breeding | 20403 |
Abundant 20-99 | 20175 |
doubtful | 19980 |
Possibly Breeding | 17800 |
Común | 17170 |
Probably Breeding | 16677 |
1 | 15406 |
irregular | 12661 |
Common | 12554 |
rare | 11010 |
Very abundant 100-499 | 7913 |
Occasional | 7602 |
Abundant | 7198 |
Rare (p < 1%) | 6525 |
collected | 6492 |
probably breeding | 6402 |
possibly breeding | 5863 |
Rare | 5803 |
Present (1% <= p < 5%) | 4940 |
Average Cover: 1-5% Maximum Cover: 1-5% | 4127 |
unclear breeding certaint | 3453 |
Very very abundant > 500 | 2813 |
Non observé | 2710 |
Песня, голос | 2378 |
Average Cover: 1-5% Maximum Cover: 6-25% | 2015 |
Common (5% <= p < 10%) | 1960 |
NT | 1847 |
Dominant (20% <= p) | 1274 |
Observed in Breeding Season | 1266 |
Abundant (10% <= p < 20%) | 1220 |
Average Cover: 76-95% Maximum Cover: 96-100% | 1135 |
Reported | 1084 |
Ausente | 1055 |
Damaged | 1051 |
Визуально | 1003 |
It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.
Thanks for raising this. If you look at the data you'll also find attempts to convey things like invasive, threatened etc which would be better elsewhere too.
Secondly, absence is only resolvable when there are some spatial and temporal limits. Shouldn't there be a check on eventDate, location and/or country to warn people there is an unbounded absence. Otherwise, the record sort of means it is absent everywhere and/or for all time.
The suggestion to add a flag for UNBOUNDED_ABSENCE
seems sensible and pragmatic. I'm mindful that modeling absence can become more complex (e.g. quantifying likelihood of observation) which shouldn't be a restriction to improving usage of presence data.
Is there a term for recording abundance?
It would be much easier for everyone if occurrenceStatus was just present, absent, doubtful and excluded.
@qgroom I understand present
and absent
, but what does doubtful
and excluded
mean for an individual occurrence?
Exciting! I hope it will be very clear to users that absence data ARE available. Sounds like it will be but just want to make sure. The P and Q are me (from a time before I was officially in charge of OBIS-USA), I'll make sure to get those corrected.
Completely agree with what @timrobertson100 (how to parse it + flags) and @ahahn-gbif (exclude absences from views by default) suggest. Some notes:
organismQuantity
and not individualCount
. Will this be rolled into individualCount
before assessment of individualCount = 0
?occurrenceStatus = absent
(and variations), but not individualCount = 0
. Will occurrenceStatus = ABSENT
be set for those? Is a flag needed?INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
and OCCURRENCE_STATUS_UNPARSABLE
you will probably need to process the most occurring occurrenceStatus
values that exist in the wild to either ABSENT
, PRESENT
(or not able to parse)?+1 for 1. - we should indeed look at organismQuantity as well
- Some datasets provide occurrenceStatus = absent (and variations), but not individualCount = 0. Will occurrenceStatus = ABSENT be set for those? Is a flag needed?
- If the individualCount is not 0, but NULL, occurrenceStatus = ABSENT is plausible We also have the opposite case, where
- individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL
- If the individualCount is not 0, but an actual (positive) value, and the occurrenceStatus = ABSENT, the flag would indeed make good sense - we will want to resolve that with publishers
- individualCount = 0, but occurrenceStatus = PRESENT. In these cases, I would value occurrenceStatus over individualCount, e.g. assuming a database or import default value, and maintain occurrenceStatus = PRESENT, suggesting individualCount likely = NULL
For the datasets I work with, this would not be a good assumption to make. Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity so if individualCount = 0 but occurrenceStatus = PRESENT it means something went wrong in the code to create occurrenceStatus.
Usually the individualCount is included first and the occurrenceStatus is created based on the individualCount or organismQuantity
Thanks, good point, I hadn't considered that. In that case, I agree it should get the same conflict flag as @peterdesmet suggested under point 3
I agree, I would also prioritize individualCount
over occurrenceStatus
. Trying to summarize:
individualCount | occurrenceStatus | inferred occurrenceStatus | flag |
---|---|---|---|
NULL | NULL | PRESENT | |
NULL | present* | PRESENT | |
NULL | absent* | ABSENT | |
NULL | rubbish | PRESENT | OCCURRENCE_STATUS_UNPARSABLE |
>0 | NULL | PRESENT | OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT |
>0 | present* | PRESENT | |
>0 | absent* | ABSENT | INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS |
>0 | rubbish | PRESENT | OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT |
0 | NULL | ABSENT | OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT |
0 | present* | PRESENT | INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS |
0 | absent* | ABSENT | |
0 | rubbish | ABSENT | OCCURRENCE_STATUS_UNPARSABLE, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT |
rubbish | NULL | PRESENT | INDIVIDUAL_COUNT_UNPARSABLE |
rubbish | present* | PRESENT | INDIVIDUAL_COUNT_UNPARSABLE |
rubbish | absent* | ABSENT | INDIVIDUAL_COUNT_UNPARSABLE |
rubbish | rubbish | PRESENT | INDIVIDUAL_COUNT_UNPARSABLE, OCCURRENCE_STATUS_UNPARSABLE |
*
= or similar values
@peterdesmet I'm not understanding why this one would be flagged:
0 | absent* | ABSENT | OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
I would think that one shouldn't get a flag since things are all in agreement?
@albenson-usgs it's a choice 🤷♂️: behind the scenes I would always infer from individualCount
if that is available and not rubbish, but you could opt not to indicate it as such (OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
) if everything is in agreement.
I'm afraid I'd disagree.
I would propose only inferring if required, otherwise what is the point of the field? This would be similar to how we handle others e.g. decimalLatitude
, decimalLongitude
and country
where country is only inferred if it is null
or needs changed where other information is added explaining why.
Therefore I'd suggest:
individualCount | occurrenceStatus | interpretted occurrenceStatus | flag |
---|---|---|---|
0 | absent* | ABSENT |
That's fine by me (have adapted in table above). But we still choose individualCount
over occurrenceStatus
if those disagree?
But we still choose individualCount over occurrenceStatus if those disagree?
Do you mean choose how to populate the interpreted occurrenceStatus
field? I would suggest:
But we still choose individualCount over occurrenceStatus if those disagree?
In support of Tim above (I think)
Normally we respect values that are there, but flag them as odd if they are in conflict with other values. E.g. coordinates: [in Paraguay]
and country: Brazil
would keep both country and coordinates but get an issue flag. If the country was missing it would be filled as Paraguay
.
I would argue we do the same for individualCount, organismQuantity and occurrenceStatus. We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.
@MortenHofft that makes sense, but does this mean that individualCount: 0
+ occurrenceStatus: present
will be interpreted as a PRESENT occurrence (and shown on maps etc.)?
but does this mean that individualCount: 0 + occurrenceStatus: present will be interpreted as a PRESENT occurrence (and shown on maps etc.)
Yes. Just like we show null island, despite it probably being faulty data. If we consider it particular critical we can add an extra warning like we do on maps.
It isn't that I want to make it difficult for users, I just think we will be in more trouble if we start to rewrite data. In time I'd rather that
It is a lot more work though :)
Here and now I like what MattBlissett mentions. Adding something similar to has_coordinate
+ has has_geospatial_issue
that filter away those cases we consider critical - for the UI that might be the best option? But those are GBIF specific flags for easy filtering, without changing incoming data.
We have an actual vocabulary for occurrenceStatus: http://rs.gbif.org/vocabulary/gbif/occurrence_status.xml
This is a horrible vocabulary for this term, because we should not be mixing up presence and absence with abundance. It would be much easier for everyone if
occurrenceStatus
was just present, absent, doubtful and excluded.
I would argue the current occurrenceStatus vocabulary is more of an abundance vocabulary than a simple boolean. More like ACFOR: https://en.wikipedia.org/wiki/Abundance_(ecology), but including doubtful, absent & excluded.
I would prefer to create a new distribution status vocabulary to be used for species distribution checklists and shrink the existing occurrenceStatus one to be just present and absent like DwC suggests. Its probably also safer to change the distribution extension to point to a new vocabulary than changing the occurrence core to point to a new one.
Quick note just to say that occurrenceStatus is a required term for OBIS and only present or absent are accepted so this falls in line with what's been outlined here.
From the OBIS Manual: occurrenceStatus (required term) is a statement about the presence or absence of a taxon at a location. It is an important term, because it allows us to distinguish between presence and absence records. It is a required term and should be filled in with either present or absent.
https://github.com/gbif/pipelines/issues/268#issuecomment-626581894:
We infer occurrenceStatus if missing, but if it is provided, we do not mess with it (despite conflicts with other fields); instead we add issue flags.
Ok, that is clearer (even though individualCount might have more reliable information, see https://github.com/gbif/pipelines/issues/268#issuecomment-624715027). I have updated my table at https://github.com/gbif/pipelines/issues/268#issuecomment-624755278 (in italic) to reflect this decision.
I've made the relevant changes in the GBIF schema sandbox, I think exactly as @mdoering suggests.
dwc:occurrenceStatus
.present
and absent
Is that reasonable for everyone?
Hi all, the limited occurrenceStatus vocab with just present
and absent
for occurrences is an improvement. From the animal tracking community, we have been thinking about using doubtful
to indicate outliers in the occurrence data, so that (discussion) might come up again in the future.
Thanks @peterdesmet
The discussion on the use of doubtful
for occurrenceStatus
is on this issue.
I propose we don't overload occurrenceStatus
with doubtful
as it raises the question whether it doubtfully present or doubtfully absent. A better option may be to model something about the confidence of the assertion in machine-based workflows (and indeed confidence of the identification for all records).
@timrobertson100 true, although outliers can be seen as both doubtfully present and absent. But indeed has something to do with confidence and might be better explained or processed elsewhere:
doubtfully present
I believe the plan here is to document a short spec based on the discussion above, to review before implementing - is that correct please @muttcg ?
I believe the plan here is to document a short spec based on the discussion above, to review before implementing - is that correct please @muttcg ?
Yes, that is correct, no coding yet, I am refactoring some parts and also we need to migrate codebase to new Jackson gbif-api libs to be able to add new features, after that I will start to document solution
@timrobertson100 1) Use OccurrenceStatus enum from org.gbif.api.vocabulary; 2) Add new filed to BasicRecord - occurrenceStatus 3) Add new vocabulary file and loader for it 4) Add new interpretatinon BasicInterpreter.interpretOccurrenceStatus 5) Add BasicInterpreter.interpretOccurrenceStatus step to BasicTransform. 6) Add OccurrenceStatus filed to es-occurrence-schema.json 7) Add new ES search and Hive visitor type
The only question I have is - Should we use a simple plain text file for the vocabulary or use new vocabulary project libs? Maybe we can use libs and file using the new vocabulary project, but don't include it in the editor.
I will create dummy BasicInterpreter.interpretOccurrenceStatus implementation, write tests for it (T.DD), create PR to review the results of the tests, and after getting approval I will implement the main logic.
Thanks @muttcg
Tests capturing the scenarios @peterdesmet compiled make sense to me, paying attention to the various flags raised.
The only question I have is - Should we use a simple plain text file for the vocabulary or use new vocabulary project libs? Maybe we can use libs and file using the new vocabulary project, but don't include it in the editor.
The only thing to be aware of is that format will likely change, which may be a nuisance. Either would work though, as it is a simple vocabulary.
Edited to add: Now that the vocabulary server is deployed in production for the proof of concept vocabulary, restricted only to a few people, and since this is a super simple vocabulary, I propose we simply edit this vocabulary in the server. @asturcon - would you agree?
I propose we simply edit this vocabulary in the server. @asturcon - would you agree?
Right, agree.
I checked the verbatim values that we currently have for occurrenceStatus
and they will map to our vocabulary as shown in this spreadsheet. I took the values that are present in at least 3 datasets or 1000 records. And I only mapped the values where the distinction between present
and absent
is clear.
I'll import these values into a vocabulary.
We can accept many more values as present
, and we should, since the old vocabulary encouraged some of them. I did this a couple of months ago for the old dictionary: https://github.com/gbif/parsers/blob/master/src/main/resources/dictionaries/parse/occurrence_status.tsv
blocked by #325
In reviewing the code I spotted an error in the table.
individualCount | occurrenceStatus | inferred occurrenceStatus | flag |
---|---|---|---|
0 | present* | PRESENT | INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT |
Should read
individualCount | occurrenceStatus | inferred occurrenceStatus | flag |
---|---|---|---|
0 | present* | PRESENT | INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS |
We are not inferring presence from an individualCount
of 0, but we do want to raise that there is conflict.
I think there's a second, similar error in the table:
individualCount | occurrenceStatus | inferred occurrenceStatus | flag |
---|---|---|---|
>0 | absent* | ABSENT | INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS, OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT |
Should instead be:
individualCount | occurrenceStatus | inferred occurrenceStatus | flag |
---|---|---|---|
>0 | absent* | ABSENT | INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS |
Thanks for noticing, I have updated the table
API and interpretation in production
Some datasets provide evidence of species absences. While this can be a difficult area to accommodate properly as modeling effort and confidence are required, there is a lot we can do to improve the current situation where consumers are given the burden of interpreting the data shared. In some cases, consumers will not have even enough information to detect this and will use absence records as presence records.
I propose we introduce the following:
occurrenceStatus
in the occurrence search and download API and then expose it on the web site. We should review the data to determine if the current vocabulary is reasonable for the observed use in data. WhereindividualCount
states 0 we should setoccurrenceStatus = ABSENT
if it is NULL and add a flagOCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT
to true. IfoccurrenceStatus
is NULL we set it to PRESENT as a sensible defaultINDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS
setting it totrue
when the count is zero but the status declares it exists (could be several values) or when the count is >0 and it is declared as absent.INDIVIDUAL_COUNT_UNPARSABLE
andOCCURRENCE_STATUS_UNPARSABLE
setting them appropriately when data cannot be parsed.