Add data field for "taxa is present in Australia"

nickdos commented 1 year ago

This is probably a pipelines process that will compare the taxon for a given record with a list of all known taxa for Australia.

The know list of Australian taxa should be derived from the ALA Biocache data, using a filter for country:Australia (uses AUS EEC layer).

CSV download:

~https://biocache.ala.org.au/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name~ ~https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name~ ~https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&qualityProfile=AVH&facets=taxonConceptID&count=true&file=AU_all_taxa_tc_counts.csv~

https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&fq=taxonRankID:[6000 TO 7000]&qualityProfile=AVH&facets=scientificName,taxonConceptID&count=true&file=AU_all_taxa_counts

Trying to generate a list of taxa for that query using SOLR or biocache-service is difficult due to the huge result set size and the API times out trying.

~~One option is to use SOLR with deep pagination using cursors. Another is to run the query on Pipelines via Spark and save the result in S3. This seems to be the safest and most reliable option.~~ Use the CSV download (above) to get data into Pipelines. The existing species-list pipeline would be a good starting point in the code. This pipeline accesses the ALA list API to pull down KV data and populate avro files using the taxon as a primary key.

It needs a field name for this data, something like presentInCountry:Australia. There might be an existing term for this, so needs some research.

nickdos commented 1 year ago

Got a first version working via this search https://nectar-arga-dev-4.ala.org.au/?q=speciesListUid:dr18679

search	count
all records	1,985,398
exact name match	1,365,052
located in Australia	639,790

nickdos commented 1 year ago

Got the custom field working

nickdos commented 1 year ago

Now deployed to dev site: https://nectar-arga-dev-4.ala.org.au/?q=presentInCountry:AUSTRALIA

ARGA-Genomes / arga-data

Add data field for "taxa is present in Australia" #31