PoisotLab / GBIF.jl

Functions and types to access GBIF data from Julia
https://ecojulia.github.io/GBIF.jl/latest/
Other
19 stars 4 forks source link

Use Base julia semantics #44

Closed rafaqz closed 2 years ago

rafaqz commented 3 years ago

size and view currently don't use base semantics. Single-arg size(x) usually returns a Tuple, and view(x) normally returns a zero dimensional SubArray.

It might make sense if values added with occurances! are appended to an existing array that would start with zero length, so that none of these things are even needed. Then GBIFRecords <: AbstractVector would makes sense too, and all of these methods would have the normal semantics. length would just be the length of the parent Array.

@mkborregaard that also means usingcollect to get all the records might not make sense? as it should return an Array but we actually still want a GBIFRecords object, and if it collected all the records, it also ends up longer than what length returns for the GBIFRecords - which is also kind of confusing for how collect usually works.

I feel like this is more like reading a stream to an array, so it's possible we could use e.g. read!(oc) to mean read the whole stream?

tpoisot commented 3 years ago

Starting with a 0-length array (which was the case at some point) is obviously a terrible idea for performance reasons, so this will not happen.

While I do get your point about the semantics, I think we can have some flexibility for now - the package works, the documentation is updated, the code has been reviewed a few times. If we want to change all of that, with all that implies for both the user-facing code and the internals, there needs to be a discussion prior to opening issues, or at least suggestions. Not only about read! (which I don't think would make sense for most users), but about how to rename the number of effective records and the total number of records.

rafaqz commented 3 years ago

As I mentioned in read! I was hoping to use GBIF.jl in teaching examples, and for documentation examples in GeoData.jl, e.g. for extract. It's the perfect package to use to demonstrate getting some point observations, and extracting variables for them. But code examples cant really contain base julia methods with non-standard semantics.

I was hoping to make some improvements to the package, and willing to do all of the work, and wouldn't do anything that reduces the performance.

tpoisot commented 3 years ago

Not that I am unwilling to entertain the idea - but I need a concrete proposal, and I need this before we introduce read! (which I think is not a user-friendly name, but we'll discuss this later).

I am really grateful that you want to make improvements to the package, but as it's developed and heavily used in the lab, I want to make sure that we discuss things before writing code, so as not to waste your time writing it and our time reviewing it.

If you could reply with a proposed fix coupled to each of your issues, we can discuss this before moving on to a PR.

rafaqz commented 3 years ago

Methods to modify:

length: feels like it should be the number of records downloaded already size: is defined in base as (length,) for an array. We should do the same here. view(::GBIFRecords): probably shouldn't be defined here. In base it returns a zero dimensional SubArray

There are two main ways to do this.

  1. Use everything as is, and just change some names, so

Generally in this case array interface methods like length have to be manipulated to give the illusion of a shorter array of GBIFRecord than we actually have.

  1. Swap how the occurrences array works so it starts small and grows with new data.

In this case the array interface makes sense for GBIFRecords as-is - it's just a vector or GBIFRecord, with some metadata - the original request and the total_occurrences field.

rafaqz commented 3 years ago

Any comments on this? It would be good to clarify this before the upcoming course. We really should be using GBIF.jl in it.

rafaqz commented 3 years ago

For a real world demonstration, this will be in the GeoData.jl docs and used in teaching:

# A basic species distribution modelling workflow
using GeoData, GBIF, CSV, Plots 

# Load occurrences for the Mountain Pygmy Possum
t = taxon("Burramys parvus")
records = occurrences(t, :limit => 300)

# Get the rest of the occurrences, we need to do this manually with a loop.
# Note: GBIF.jl uses non-standard semantics for `size`. This is comparing 
# the occurrances already downloaded with the total occurrances.
while length(records) < size(records)
    occurrences!(records)
end

# Extract the longitude/latitude value to a Vector of Tuple
coords = map(r -> (r.longitude, r.latitude), records)

# Get BioClim layers and subset to south-east australia
A = GeoStack(WorldClim{BioClim}, (1, 3, 7, 12))
SE_aus = A[X=Between(138, 155), Y=Between(-40, -25), Band=1]

# Plot BioClim predictors and scatter occurrence points on all subplots
p = plot(SE_aus);
foreach(i -> scatter!(p, coords; subplot=i, legend=:none), 1:4)
display(p)

# Extract predictor variables and write to CSV
predictors = extract(SE_aus, coords)
CSV.write("burramys_parvus_predictors.csv", predictors)

From my perspective it would be best to instead of the while loop do something like:

read!(occurences)

But if you were to use a while loop or anything similar, it would be something like:

while length(records) < total_occurences(records)
    occurrences!(records)
end

So that the non-standard meaning of size doesn't have to be explained. view(records) would give an error, for the same reason.

tpoisot commented 3 years ago

What about count for total_occurrences?

tpoisot commented 3 years ago

Re. read!, I don't think it's a good idea, because we have a number of cases where we do something at the end of every page of request.

tpoisot commented 3 years ago

And as a third point, I don't think the Array should grow. We know its size at the beginning, and back when it did grew with each page, the requests were getting slower and slower.

tpoisot commented 3 years ago

So I'd vote for option 1, with maybe count instead of total_occurrences - go ahead with a PR?