Open nickdos opened 1 year ago
Got a first version working via this search https://nectar-arga-dev-4.ala.org.au/?q=speciesListUid:dr18679
search | count |
---|---|
all records | 1,985,398 |
exact name match | 1,365,052 |
located in Australia | 639,790 |
Got the custom field working
Now deployed to dev site: https://nectar-arga-dev-4.ala.org.au/?q=presentInCountry:AUSTRALIA
This is probably a pipelines process that will compare the taxon for a given record with a list of all known taxa for Australia.
The know list of Australian taxa should be derived from the ALA Biocache data, using a filter for
country:Australia
(uses AUS EEC layer).CSV download:
~https://biocache.ala.org.au/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name~ ~https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name~ ~https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&qualityProfile=AVH&facets=taxonConceptID&count=true&file=AU_all_taxa_tc_counts.csv~
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&fq=taxonRankID:[6000 TO 7000]&qualityProfile=AVH&facets=scientificName,taxonConceptID&count=true&file=AU_all_taxa_counts
Trying to generate a list of taxa for that query using SOLR or biocache-service is difficult due to the huge result set size and the API times out trying.
~~One option is to use SOLR with deep pagination using
cursors
. Another is to run the query on Pipelines via Spark and save the result in S3. This seems to be the safest and most reliable option.~~ Use the CSV download (above) to get data into Pipelines. The existingspecies-list
pipeline would be a good starting point in the code. This pipeline accesses the ALA list API to pull down KV data and populate avro files using the taxon as a primary key.It needs a field name for this data, something like
presentInCountry:Australia
. There might be an existing term for this, so needs some research.