Open mjwestgate opened 1 month ago
Relates to support ticket: https://support.ehelp.edu.au/a/tickets/209037
The contents of the location
field are the same as the lat_long
field. This is signified by sourceFields
in the index/fields
service. This location
field is not intended for use with the download service. The service is likely misleading because we include the dataType name (instead of class) and indicate that it is stored=true. There is also an intentional lack of other information on the record such as description, downloadDescription, info, class(s), dwcTerm.
The problem that needs fixing is with the biocache-service index/fields
service. It is currently exposing fields that are intended for use in search only (not facets, not downloads) but that still report stored=true
because that is required for other reasons.
I think dataTypes should be removed as their usage requires knowledge about SOLR queries. dataTypes geohash, packedQuad, quad, location.
The intention is to keep other search only fields in the index/fields
response.
There is no intention to include virtual search fields in index/fields
.
OK thanks @adam-collins, that makes sense. It also tallies with our workflows; we only allow users to query fields that are listed in index/fields
, so if they aren't in there, the query will get stopped by galah
at an earlier stage.
While we're doing that it might make sense to have a spring clean of other content too. The first three fields listed are _nest_parent_
, _nest_path_
and _root_
, for example, which doesn't seem right either.
Post cleanup of index/fields
, it will contain no internal use or fields with data types deemed complicated use. It will include:
To differentiate between the two
stored: true
can be downloaded and facetedstored: false
cannot be downloaded or faceted
This is based on an issue identified using galah here. Basically, when we select a field in our occurrence download, for a query where no records have data in that field, the whole download fails. I've put @daxkellie's summary of the problem below.
To walk through the problem, the following query asks for counts of Acacia aneura grouped by
scientficName
:https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=scientificName&fsort=count&flimit=10000
It returns this:
[{"fieldName":"scientificName","fieldResult":[{"label":"Acacia aneura","i18nCode":"scientificName.Acacia aneura","count":80,"fq":"scientificName:\"Acacia aneura\""},{"label":"Acacia aneura var. major","i18nCode":"scientificName.Acacia aneura var. major","count":6,"fq":"scientificName:\"Acacia aneura var. major\""},{"label":"Acacia aneura var. aneura","i18nCode":"scientificName.Acacia aneura var. aneura","count":1,"fq":"scientificName:\"Acacia aneura var. aneura\""}],"count":3}]
Which is great. By changing
facets
tolocation
, we get no records, suggesting that this field is empty:https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=location&fsort=count&flimit=10000
Again, fine. We then format request as an occurrence download, including a number of fields including
location
:"https://biocache-ws.ala.org.au/ws/occurrences/offline/download?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&fields=recordID%2CscientificName%2CvernacularName%2Ckingdom%2CeventDate%2CsamplingProtocol%2CindividualCount%2CrecordedBy%2Clocation&qa=none&facet=false&emailNotify=false&sourceTypeId=2004&reasonTypeId=4&email=martinjwestgate%40gmail.com&dwcHeaders=true"
This runs, stating we expect to receive 87 records:
{"status":"inQueue","totalRecords":87,"queueSize":1,"statusUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/status/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","cancelUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/cancel/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","searchUrl":"https://biocache.ala.org.au/occurrences/search?&q=*%3A*&fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&disableAllQualityFilters=true&fq=-basisOfRecord%3A%22FOSSIL_SPECIMEN%22+AND+-%28basisOfRecord%3A%22MATERIAL_SAMPLE%22+AND+contentTypes%3A%22Environmental+DNA%22%29&fq=-%28duplicate_status%3A%22ASSOCIATED%22+AND+duplicateType%3A%22DIFFERENT_DATASET%22%29&fq=-assertions%3ATAXON_MATCH_NONE+AND+-assertions%3AINVALID_SCIENTIFIC_NAME+AND+-assertions%3ATAXON_HOMONYM+AND+-assertions%3AUNKNOWN_KINGDOM+AND+-assertions%3ATAXON_SCOPE_MISMATCH&fq=-occurrenceStatus%3AABSENT&fq=-identificationVerificationStatus%3A%22needs_id%22&fq=-userAssertions%3A50001+AND+-userAssertions%3A50005&fq=-year%3A%5B*+TO+1700%5D&fq=-establishmentMeans%3A%22MANAGED%22+AND+-decimalLatitude%3A0+AND+-decimalLongitude%3A0+AND+-assertions%3A%22PRESUMED_SWAPPED_COORDINATE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_STATEPROVINCE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_COUNTRY%22+AND+-assertions%3A%22PRESUMED_NEGATED_LATITUDE%22+AND+-assertions%3A%22PRESUMED_NEGATED_LONGITUDE%22&fq=-outlierLayerCount%3A%5B3+TO+*%5D&fq=-spatiallyValid%3A%22false%22&fq=-coordinateUncertaintyInMeters%3A%5B10001+TO+*%5D"}
Finally, the resulting Zip file (
https://biocache.ala.org.au/biocache-download/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1/1729125651839/data.zip"
) has no data in it. What we would expect instead would be for all the requested fields to be downloaded, but with onlyNA
s in thelocation
column.