AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

Adding a `field` without data breaks larger occurrences download #930

Open mjwestgate opened 1 month ago

mjwestgate commented 1 month ago

This is based on an issue identified using galah here. Basically, when we select a field in our occurrence download, for a query where no records have data in that field, the whole download fails. I've put @daxkellie's summary of the problem below.

To walk through the problem, the following query asks for counts of Acacia aneura grouped by scientficName:

https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=scientificName&fsort=count&flimit=10000

It returns this: [{"fieldName":"scientificName","fieldResult":[{"label":"Acacia aneura","i18nCode":"scientificName.Acacia aneura","count":80,"fq":"scientificName:\"Acacia aneura\""},{"label":"Acacia aneura var. major","i18nCode":"scientificName.Acacia aneura var. major","count":6,"fq":"scientificName:\"Acacia aneura var. major\""},{"label":"Acacia aneura var. aneura","i18nCode":"scientificName.Acacia aneura var. aneura","count":1,"fq":"scientificName:\"Acacia aneura var. aneura\""}],"count":3}]

Which is great. By changing facets to location, we get no records, suggesting that this field is empty:

https://api.ala.org.au/occurrences/occurrences/facets?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&facets=location&fsort=count&flimit=10000

Again, fine. We then format request as an occurrence download, including a number of fields including location:

"https://biocache-ws.ala.org.au/ws/occurrences/offline/download?fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&qualityProfile=ALA&fields=recordID%2CscientificName%2CvernacularName%2Ckingdom%2CeventDate%2CsamplingProtocol%2CindividualCount%2CrecordedBy%2Clocation&qa=none&facet=false&emailNotify=false&sourceTypeId=2004&reasonTypeId=4&email=martinjwestgate%40gmail.com&dwcHeaders=true"

This runs, stating we expect to receive 87 records:

{"status":"inQueue","totalRecords":87,"queueSize":1,"statusUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/status/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","cancelUrl":"https://biocache-ws.ala.org.au/ws/occurrences/offline/cancel/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1-1729125651839","searchUrl":"https://biocache.ala.org.au/occurrences/search?&q=*%3A*&fq=%28year%3A%222002%22%29AND%28lsid%3A%22https%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F6707550%22%29&disableAllQualityFilters=true&fq=-basisOfRecord%3A%22FOSSIL_SPECIMEN%22+AND+-%28basisOfRecord%3A%22MATERIAL_SAMPLE%22+AND+contentTypes%3A%22Environmental+DNA%22%29&fq=-%28duplicate_status%3A%22ASSOCIATED%22+AND+duplicateType%3A%22DIFFERENT_DATASET%22%29&fq=-assertions%3ATAXON_MATCH_NONE+AND+-assertions%3AINVALID_SCIENTIFIC_NAME+AND+-assertions%3ATAXON_HOMONYM+AND+-assertions%3AUNKNOWN_KINGDOM+AND+-assertions%3ATAXON_SCOPE_MISMATCH&fq=-occurrenceStatus%3AABSENT&fq=-identificationVerificationStatus%3A%22needs_id%22&fq=-userAssertions%3A50001+AND+-userAssertions%3A50005&fq=-year%3A%5B*+TO+1700%5D&fq=-establishmentMeans%3A%22MANAGED%22+AND+-decimalLatitude%3A0+AND+-decimalLongitude%3A0+AND+-assertions%3A%22PRESUMED_SWAPPED_COORDINATE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_STATEPROVINCE%22+AND+-assertions%3A%22COORDINATES_CENTRE_OF_COUNTRY%22+AND+-assertions%3A%22PRESUMED_NEGATED_LATITUDE%22+AND+-assertions%3A%22PRESUMED_NEGATED_LONGITUDE%22&fq=-outlierLayerCount%3A%5B3+TO+*%5D&fq=-spatiallyValid%3A%22false%22&fq=-coordinateUncertaintyInMeters%3A%5B10001+TO+*%5D"}

Finally, the resulting Zip file (https://biocache.ala.org.au/biocache-download/bb0481b1-8b95-33af-9a6d-16c7aa24f0f1/1729125651839/data.zip") has no data in it. What we would expect instead would be for all the requested fields to be downloaded, but with only NAs in the location column.

kylie-m commented 1 month ago

Relates to support ticket: https://support.ehelp.edu.au/a/tickets/209037

adam-collins commented 1 month ago

The contents of the location field are the same as the lat_long field. This is signified by sourceFields in the index/fields service. This location field is not intended for use with the download service. The service is likely misleading because we include the dataType name (instead of class) and indicate that it is stored=true. There is also an intentional lack of other information on the record such as description, downloadDescription, info, class(s), dwcTerm.

The problem that needs fixing is with the biocache-service index/fields service. It is currently exposing fields that are intended for use in search only (not facets, not downloads) but that still report stored=true because that is required for other reasons.

I think dataTypes should be removed as their usage requires knowledge about SOLR queries. dataTypes geohash, packedQuad, quad, location.

image

The intention is to keep other search only fields in the index/fields response.

There is no intention to include virtual search fields in index/fields.

mjwestgate commented 1 month ago

OK thanks @adam-collins, that makes sense. It also tallies with our workflows; we only allow users to query fields that are listed in index/fields, so if they aren't in there, the query will get stopped by galah at an earlier stage.

While we're doing that it might make sense to have a spring clean of other content too. The first three fields listed are _nest_parent_, _nest_path_ and _root_, for example, which doesn't seem right either.

adam-collins commented 1 month ago

Post cleanup of index/fields, it will contain no internal use or fields with data types deemed complicated use. It will include:

To differentiate between the two