AtlasOfLivingAustralia / ALA4R

Access data and resources hosted by the Atlas of Living Australia (ALA)
https://atlasoflivingaustralia.github.io/ALA4R/
41 stars 8 forks source link

Suspected.outlier field no longer in occurrence download? #27

Closed snubian closed 7 years ago

snubian commented 7 years ago

I've noticed that a field I've used in the past does not seem to be included in occurrence downloads anymore. I believe the field was called Suspected.outlier but might've been detectedOutlier in ALA4R::occurrences(). Downloads I did using http://biocache.ala.org.au/ws/occurrences/index/download back in July 2016 included this field, and I feel like I've seen it recently, but it's no longer included when I run the same download, or when using occurrences().

Do you guys keep track of these sorts of things? I figured you might need to know as you seem to prettify the naming of the fields.

Thanks!

raymondben commented 7 years ago

Some field names changed around April, suspect this was one of them. Looking at ala_fields("occurrence_indexed") I see a field called outlier_layer (whose description is "Outlier for layer"). Is this it? The corresponding column name in the occurrence download data is outlierLayer.

We do prettify the column names in ALA4R, which was originally to try and make them consistent across functions. With the new field name changes we'll need to look at that at make sure we're still doing sensible things. Backwards compatibility might become an issue here, but we'll see how we go I guess.

snubian commented 7 years ago

Thanks @raymondben - I saw the outlier_layer column, but the data in it is stuff like el845 or whatever, so I think it means that the data for those environmental layers are considered outliers(?). Previously the field was a TRUE/FALSE indicating a suspected spatial outlier.

I suspect it's simply gone, will dig around a bit more. Thanks again for your quick response, and once again for your efforts with this fantastic package.

raymondben commented 7 years ago

The contents of the el845 etc environmental fields are populated from gridded environmental data, so if the position of the observation is a spatial outlier then those values will be outliers with respect to the norm for that species. So I'm guessing that previously the "outlier" status was calculated on the basis of environmental layers and now it is just giving more info about which layers indicate its outlieriness ... but I'm only guessing. @nickdos @adam-collins - can you enlighten us?

nickdos commented 7 years ago

When we introduced our "offline" downloads, we changed the way the download file was generated but tried to keep most fields the same. The old download format is still available but is limited to 100,000 records and the new newer offline has no limit (for now). E.g. http://biocache.ala.org.au/ws/occurrences/download?q=genus:Macropus (e.g. without the /offline part).

The difference is the old download is produced directly from the SOLR index, whereas the new download is produced from the Cassandra database directly (SOLR index is a subset of data in Cassandra). The outlier_layer is only in the SOLR index I think, so we either need to calculate that value on the fly for the offline download, etc.

For now, the SOLR download is still available, so I think ALA4R could provide an option to use the older SOLR download (100,000 max) or the newer offline download. The SOLR download will be quicker and may suite some users better but it makes the API more complicated, trying to explain the existence of 2 similar but slightly different APIs.

Another work around () would be to use the web interface to build 2 queries, one with records where detectedOutlier is true and another where it is false. Then trigger 2 downloads and then merge them after manually setting the values for detectedOutlier... if that makes sense.

raymondben commented 7 years ago

ALA4R does both the indexed (SOLR) and offline methods. The outlierforLayers column seems to be returned in both. I think our main question here is whether that field is equivalent to the old Suspected.outlier or detectedOutlier field.

snubian commented 7 years ago

Thanks once again @nickdos! And @raymondben I may have the answer to that question. Just looking at a recent download, the outlierForLayer field has data like el882, el865 etc, which refer to bioclimatic variables such as temperature, precipitation, etc. This is also the same as Outlier for layer and Outlier layer count filters on the web interface.

A download from some months ago includes both the Outlier.for.layer field (with data as above) and the Suspected.outlier field, which to my understanding is a TRUE/FALSE indicator of spatial outliers. So they seem to be different fields, yes. It would be great to know if a) the spatial outlier field still exists, and b) if it can be gotten at.

EDIT: I've tried Nick's suggestion for the old download format, it doesn't have suspectedOutlier but does have the Outlier.for.layer though it's a 0/1 field.

As usual, many thanks to everyone, and any assistance is greatly appreciated :)

P.S. I should add, I'm not 100% sure that the old suspectedOutlier field was actually what I think it was!