Closed djtfmartin closed 3 years ago
Most of these are now gone. They disappeared last week, causing the index checks to fail. A manual check showed that most of these problem values were gone. We couldn't find them in the original DwCAs.
Don't get me wrong, some of the values are still terrible:
Needs more investigation, values come from list tool not DwCA.
This is a record returned by selecting the NSW Dept of Planning, Industry and Environment
value on the stateConservation facet.
https://biocache.ala.org.au/occurrences/e6407f6f-1c86-4f87-94d8-f96cd1b33103
Note that the stateConservation value in the record is:
State conservation | New South Wales: Industry and Environment |
---|
which looks suspiciously similar to the value in Owner institution code
Odd values dont appear to come from lists
scala> spark.sqlContext.sql("CREATE TEMPORARY VIEW SPECIES_LIST USING avro OPTIONS (path \"hdfs://aws-spark-quoll-master.ala:9000/pipelines-species/species-lists/species-lists.avro\")")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sqlContext.sql("SELECT distinct status FROM SPECIES_LIST order by status").show(300, false)
+-------------------------------------------------------------------------------------------------------------------------------------+
|status |
+-------------------------------------------------------------------------------------------------------------------------------------+
|null |
| |
|#N/A |
|(Introduced) |
|(not listed) |
|Conservation Dependent |
|Conservation dependant |
|Critically Endangered |
|Critically endangered |
|Critically endangered and Migratory birds protected under international agreement |
|Data Deficient |
|Endangered |
|Endangered and Migratory birds protected under international agreement |
|Extinct |
|Extinct in the N.T. |
|Extinct in the Wild |
|Extinct in the wild |
|INFRA |
|International |
|Least Concern |
|Least Concern/Unknown |
|Migratory birds protected under international agreement |
|Migratory birds protected under international agreement and Priority 4: Rare, Near Threatened and other species in need of monitoring|
|Migratory birds protected under international agreement, vulnerable at subspecies |
|Near Threatened |
|Not Evaluated |
|Other specially protected |
|Presumed Extinct |
|Presumed extinct |
|Priority 1: Poorly-known species |
|Priority 2: Poorly-known species |
|Priority 3: Poorly-known species |
|Priority 4: Rare, Near Threatened and other species in need of monitoring |
|Special Least Concern |
|Vulnerable |
|Vulnerable and Migratory birds protected under international agreement |
|Vulnerable, Migratory birds protected under international agreement at species level |
+-------------------------------------------------------------------------------------------------------------------------------------+
scala> spark.sqlContext.sql("SELECT distinct sourceStatus FROM SPECIES_LIST order by sourceStatus").show(300, false)
+---------------------------------------------------+
|sourceStatus |
+---------------------------------------------------+
|null |
| |
|Conservation dependent |
|Critically Endangered |
|Critically endangered |
|Critically endangered wildlife |
|Data Deficient or Delisted |
|Endangered |
|Endangered wildlife |
|Extinct |
|Extinct in the wild wildlife |
|Extinct or Endangered |
|Extinct or Endangered (needs further qualification)|
|Extinct wildlife |
|International wildlife |
|Least concern wildlife |
|Listed under FFG Act |
|Near Threatened |
|Near threatened wildlife |
|Not listed (needs further qualification) |
|Poorly Known |
|Presumed Extinct |
|Priority Five |
|Priority Four |
|Priority One |
|Priority Three |
|Priority Two |
|Rare |
|Rare (needs further qualification) |
|Rare and Restricted |
|Regionally Extinct |
|Special least concern wildlife |
|Vulnerable |
|Vulnerable (needs further qualification) |
|Vulnerable wildlife |
|endangered |
|vulnerable |
+---------------------------------------------------+
stateConservation
values are being supplied by BioNET and WildNet, and this is where the odd values are coming from. Below is a listing from BioNET.
scala> spark.sqlContext.sql("CREATE TEMPORARY VIEW dr368 USING avro OPTIONS (path \"hdfs://....../pipelines-data/dr368/1/verbatim.avro\")")
scala> spark.sqlContext.sql("select coreTerms.`http://unknown.org/stateConservation`, count(*) from dr368 group by coreTerms.`http://unknown.org/stateConservation` ").show(false)
+----------------------------------------------+--------+
|http://unknown.org/stateConservation |count(1)|
+----------------------------------------------+--------+
|Endangered |3936 |
|Not Listed |27157 |
|Vulnerable |7942 |
|null |12124218|
|NSW National Parks and Wildlife Service |3 |
|NSW Dept of Planning, Industry and Environment|582 |
|Endangered Population, Vulnerable |25 |
|Critically Endangered |247 |
|Extinct |5 |
|Endangered Population |2 |
+----------------------------------------------+--------+
@peggynewman I think you can take this from here. Please let us know otherwise.
Issue moved to AtlasOfLivingAustralia/data-management #700 via ZenHub
Some odd values are making it through in the
stateConservation
field.Need to check the values extracted from lists.ala.org.au.