AtlasOfLivingAustralia / ALA4R

Access data and resources hosted by the Atlas of Living Australia (ALA)
https://atlasoflivingaustralia.github.io/ALA4R/
41 stars 8 forks source link

fq matches on empty string in specieslist() #24

Closed snubian closed 7 years ago

snubian commented 7 years ago

Just noticed this when using specieslist(), e.g.:

> wktPoly <- "POLYGON((152.38 -30.43,152.5 -30.43,152.5 -30.5,152.38 -30.5,152.38 -30.43))"
> x <- ALA4R::specieslist(wkt = wktPoly, fq = "kingdom:Plantae")
> table(x$kingdom, useNA = "always")

        Plantae    <NA> 
    156     964       0 

So having specified fq = "kingdom:Plantae" we have 156 records with empty string for kingdom.

In some ways I can see why it is informative to include these records with missing values, so I'm not sure if this behaviour is by design. But perhaps an option in the style of na.rm could be included?

raymondben commented 7 years ago

Very odd. The species Wijkia extenuata is one from your example. In the output of specieslist() and indeed the output of species_info() or taxinfo_download() it has no kingdom:

> species_info(guid="e851274b-d043-4cef-ba01-2a7eb5abc80f")$classification$kingdom
NULL

> taxinfo_download("e851274b-d043-4cef-ba01-2a7eb5abc80f")$kingdom
[1] NA

But its page on the ALA web site (http://bie.ala.org.au/species/e851274b-d043-4cef-ba01-2a7eb5abc80f#classification) puts it in the kingdom Plantae. So I presume that it is being included in the returned result set because some part of the server database thinks it's in Plantae, but it comes back with empty kingdom because another part doesn't. I'll check in with the ALA devs.

snubian commented 7 years ago

Thank you once again for a prompt response.

I've been using the ALA's web services off and on for several years and still I have no precise understanding of what is happening under the hood. I've more or less accepted that this will remain one of life's mysteries. At least your package makes it much simpler :)

nickdos commented 7 years ago

I'll pass this onto Doug, who is working with the names processing. I have a feeling the Kingdom is being inferred by the fact that the source is "AusMoss". The species page is calling a separate webservice to build the taxonomy, so the smarts is probably coming from that service. It looks like a bug that it is not being correctly placed in our taxonomy using the normal species service.

nickdos commented 7 years ago

@raymondben can you lookup the actual webservice the plugin is calling for the ALA4R::specieslist command, please? I have a feeling the fq=kingdom:Plantae might be deprecated depending on which service its hitting... Its worth trying this instead: fq=rk_kingdom:Plantae - species names fields changed a bit last year - full list of fields is available at http://biocache.ala.org.au/ws/index/fields. EDIT - looks like its hitting biocache.ala.org.au not bie.ala.org.au as I thought. In which case fq looks OK.

raymondben commented 7 years ago

It's using http://api.ala.org.au/#ws106 (http://biocache.ala.org.au/ws/occurrences/facets/download?...)

nickdos commented 7 years ago

The issue of blank kingdom appearing the species list CSV output is due to the missing kingdom data in the BIE (that Ben noted) but the fact that some occurrence records provide kingdom in their original darwin core data. Thus the fq=kingdom:Plante returns records but then the subsequent lookup against the BIE for each unique species in the occurrence facet results, provides an empty "kingdom" column. Should be fixed with better smarts for populating higher taxa in the BIE, which is an ongoing "improvement" we're working on.

raymondben commented 7 years ago

Thanks @nickdos for tracking it down. Looks like we can safely assume that any returned record does satisfy the fq filter (if one has been given)? Until those BIE improvements are done, I don't think it's possible to build a general workaround at the R end, but users can repopulate missing fields themselves if needed. I'll make a note in the function help.

raymondben commented 7 years ago

@nickdos what does the q parameter actually get matched against with that service? The API docs say "Query of the form field:value e.g. q=genus:Macropus or a free text search e.g. q=Macropus" but I think the free-text part of that is no longer correct. Previously this worked: http://biocache.ala.org.au/ws/occurrences/facets/download?q=Macropus&facets=taxon_concept_lsid&lookup=true&count=true but now gives no matches. A "genus:Macropus" style query still works: http://biocache.ala.org.au/ws/occurrences/facets/download?q=genus%3AMacropus&facets=taxon_concept_lsid&lookup=true&count=true Has something changed or have I misunderstood the usage?

nickdos commented 7 years ago

@raymondben just yesterday we discovered that biocache searches are not working without a field specified - this is a bug that slipped into the last full re-index. We use SOLR and it allows you to set a default field, which is "text", so q=Macropus is effectively q=text:Macropus.

raymondben commented 7 years ago

The original problem here (empty taxonomic fields) is an issue with the underlying ALA service, and is being addressed in https://github.com/AtlasOfLivingAustralia/bie-index/issues/134. Closing this one.

snubian commented 7 years ago

Thanks guys for chasing this up, much appreciated.