AtlasOfLivingAustralia / la-pipelines

Living Atlas Pipelines extensions
3 stars 4 forks source link

Odd values in `stateConservation` field #480

Closed djtfmartin closed 3 years ago

djtfmartin commented 3 years ago

Some odd values are making it through in the stateConservation field.

Screen Shot 2021-07-07 at 7 45 40 pm

Need to check the values extracted from lists.ala.org.au.

peggynewman commented 3 years ago

Most of these are now gone. They disappeared last week, causing the index checks to fail. A manual check showed that most of these problem values were gone. We couldn't find them in the original DwCAs.

peggynewman commented 3 years ago

Don't get me wrong, some of the values are still terrible: image

javier-molina commented 3 years ago

Needs more investigation, values come from list tool not DwCA.

peggynewman commented 3 years ago

This is a record returned by selecting the NSW Dept of Planning, Industry and Environment value on the stateConservation facet. https://biocache.ala.org.au/occurrences/e6407f6f-1c86-4f87-94d8-f96cd1b33103 Note that the stateConservation value in the record is:

State conservation New South Wales: Industry and Environment

which looks suspiciously similar to the value in Owner institution code

djtfmartin commented 3 years ago

Odd values dont appear to come from lists

scala> spark.sqlContext.sql("CREATE TEMPORARY VIEW SPECIES_LIST USING avro OPTIONS (path \"hdfs://aws-spark-quoll-master.ala:9000/pipelines-species/species-lists/species-lists.avro\")")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sqlContext.sql("SELECT distinct status FROM SPECIES_LIST order by status").show(300, false) 
+-------------------------------------------------------------------------------------------------------------------------------------+
|status                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------+
|null                                                                                                                                 |
|                                                                                                                                     |
|#N/A                                                                                                                                 |
|(Introduced)                                                                                                                         |
|(not listed)                                                                                                                         |
|Conservation Dependent                                                                                                               |
|Conservation dependant                                                                                                               |
|Critically Endangered                                                                                                                |
|Critically endangered                                                                                                                |
|Critically endangered and Migratory birds protected under international agreement                                                    |
|Data Deficient                                                                                                                       |
|Endangered                                                                                                                           |
|Endangered and Migratory birds protected under international agreement                                                               |
|Extinct                                                                                                                              |
|Extinct in the N.T.                                                                                                                  |
|Extinct in the Wild                                                                                                                  |
|Extinct in the wild                                                                                                                  |
|INFRA                                                                                                                                |
|International                                                                                                                        |
|Least Concern                                                                                                                        |
|Least Concern/Unknown                                                                                                                |
|Migratory birds protected under international agreement                                                                              |
|Migratory birds protected under international agreement and Priority 4: Rare, Near Threatened and other species in need of monitoring|
|Migratory birds protected under international agreement, vulnerable at subspecies                                                    |
|Near Threatened                                                                                                                      |
|Not Evaluated                                                                                                                        |
|Other specially protected                                                                                                            |
|Presumed Extinct                                                                                                                     |
|Presumed extinct                                                                                                                     |
|Priority 1: Poorly-known species                                                                                                     |
|Priority 2: Poorly-known species                                                                                                     |
|Priority 3: Poorly-known species                                                                                                     |
|Priority 4: Rare, Near Threatened and other species in need of monitoring                                                            |
|Special Least Concern                                                                                                                |
|Vulnerable                                                                                                                           |
|Vulnerable and Migratory birds protected under international agreement                                                               |
|Vulnerable, Migratory birds protected under international agreement at species level                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------+

scala> spark.sqlContext.sql("SELECT distinct sourceStatus FROM SPECIES_LIST order by sourceStatus").show(300, false) 
+---------------------------------------------------+                           
|sourceStatus                                       |
+---------------------------------------------------+
|null                                               |
|                                                   |
|Conservation dependent                             |
|Critically Endangered                              |
|Critically endangered                              |
|Critically endangered wildlife                     |
|Data Deficient or Delisted                         |
|Endangered                                         |
|Endangered wildlife                                |
|Extinct                                            |
|Extinct in the wild wildlife                       |
|Extinct or Endangered                              |
|Extinct or Endangered (needs further qualification)|
|Extinct wildlife                                   |
|International wildlife                             |
|Least concern wildlife                             |
|Listed under FFG Act                               |
|Near Threatened                                    |
|Near threatened wildlife                           |
|Not listed (needs further qualification)           |
|Poorly Known                                       |
|Presumed Extinct                                   |
|Priority Five                                      |
|Priority Four                                      |
|Priority One                                       |
|Priority Three                                     |
|Priority Two                                       |
|Rare                                               |
|Rare (needs further qualification)                 |
|Rare and Restricted                                |
|Regionally Extinct                                 |
|Special least concern wildlife                     |
|Vulnerable                                         |
|Vulnerable (needs further qualification)           |
|Vulnerable wildlife                                |
|endangered                                         |
|vulnerable                                         |
+---------------------------------------------------+
djtfmartin commented 3 years ago

stateConservation values are being supplied by BioNET and WildNet, and this is where the odd values are coming from. Below is a listing from BioNET.

scala> spark.sqlContext.sql("CREATE TEMPORARY VIEW dr368 USING avro OPTIONS (path \"hdfs://....../pipelines-data/dr368/1/verbatim.avro\")")

scala>  spark.sqlContext.sql("select coreTerms.`http://unknown.org/stateConservation`,  count(*) from dr368 group by coreTerms.`http://unknown.org/stateConservation` ").show(false) 
+----------------------------------------------+--------+                       
|http://unknown.org/stateConservation          |count(1)|
+----------------------------------------------+--------+
|Endangered                                    |3936    |
|Not Listed                                    |27157   |
|Vulnerable                                    |7942    |
|null                                          |12124218|
|NSW National Parks and Wildlife Service       |3       |
|NSW Dept of Planning, Industry and Environment|582     |
|Endangered Population, Vulnerable             |25      |
|Critically Endangered                         |247     |
|Extinct                                       |5       |
|Endangered Population                         |2       |
+----------------------------------------------+--------+
javier-molina commented 3 years ago

@peggynewman I think you can take this from here. Please let us know otherwise.

javier-molina commented 3 years ago

Issue moved to AtlasOfLivingAustralia/data-management #700 via ZenHub