Closed rafaqz closed 2 years ago
Starting with a 0-length array (which was the case at some point) is obviously a terrible idea for performance reasons, so this will not happen.
While I do get your point about the semantics, I think we can have some flexibility for now - the package works, the documentation is updated, the code has been reviewed a few times. If we want to change all of that, with all that implies for both the user-facing code and the internals, there needs to be a discussion prior to opening issues, or at least suggestions. Not only about read!
(which I don't think would make sense for most users), but about how to rename the number of effective records and the total number of records.
As I mentioned in read!
I was hoping to use GBIF.jl in teaching examples, and for documentation examples in GeoData.jl, e.g. for extract
. It's the perfect package to use to demonstrate getting some point observations, and extracting variables for them. But code examples cant really contain base julia methods with non-standard semantics.
I was hoping to make some improvements to the package, and willing to do all of the work, and wouldn't do anything that reduces the performance.
Not that I am unwilling to entertain the idea - but I need a concrete proposal, and I need this before we introduce read!
(which I think is not a user-friendly name, but we'll discuss this later).
I am really grateful that you want to make improvements to the package, but as it's developed and heavily used in the lab, I want to make sure that we discuss things before writing code, so as not to waste your time writing it and our time reviewing it.
If you could reply with a proposed fix coupled to each of your issues, we can discuss this before moving on to a PR.
Methods to modify:
length
: feels like it should be the number of records downloaded already
size
: is defined in base as (length,)
for an array. We should do the same here.
view(::GBIFRecords
): probably shouldn't be defined here. In base it returns a zero dimensional SubArray
There are two main ways to do this.
size
(the length of the array) becomes e.g. total_occurrences
.length
(the number of fetched records) stays as length
view
becomes e.g records
which calls view(o.occuurences, length(o))
. Array
/vec
/collect
etc could all call records
. getindex
is a little odd... does it index o.occurrences
or records(o)
? I'm not sure.Generally in this case array interface methods like length
have to be manipulated to give the illusion of a shorter array of GBIFRecord
than we actually have.
push!
. This isn't slow compared to how long a HTTP request takes.GBIFRecords
is then <: AbstractArray
and can be a direct forwarding of the whole interface to o.occurrences
. It can just be indexed directly as none of the indices are undef
. So view
isn't needed at all, but you can use view
on it in the normal way to take views of indices.total_occurrences
is then a field we store in the GBIFRecords
object, to track how many there are in total, and how meany are left to download.In this case the array interface makes sense for GBIFRecords
as-is - it's just a vector or GBIFRecord
, with some metadata - the original request and the total_occurrences
field.
Any comments on this? It would be good to clarify this before the upcoming course. We really should be using GBIF.jl in it.
For a real world demonstration, this will be in the GeoData.jl docs and used in teaching:
# A basic species distribution modelling workflow
using GeoData, GBIF, CSV, Plots
# Load occurrences for the Mountain Pygmy Possum
t = taxon("Burramys parvus")
records = occurrences(t, :limit => 300)
# Get the rest of the occurrences, we need to do this manually with a loop.
# Note: GBIF.jl uses non-standard semantics for `size`. This is comparing
# the occurrances already downloaded with the total occurrances.
while length(records) < size(records)
occurrences!(records)
end
# Extract the longitude/latitude value to a Vector of Tuple
coords = map(r -> (r.longitude, r.latitude), records)
# Get BioClim layers and subset to south-east australia
A = GeoStack(WorldClim{BioClim}, (1, 3, 7, 12))
SE_aus = A[X=Between(138, 155), Y=Between(-40, -25), Band=1]
# Plot BioClim predictors and scatter occurrence points on all subplots
p = plot(SE_aus);
foreach(i -> scatter!(p, coords; subplot=i, legend=:none), 1:4)
display(p)
# Extract predictor variables and write to CSV
predictors = extract(SE_aus, coords)
CSV.write("burramys_parvus_predictors.csv", predictors)
From my perspective it would be best to instead of the while
loop do something like:
read!(occurences)
But if you were to use a while loop or anything similar, it would be something like:
while length(records) < total_occurences(records)
occurrences!(records)
end
So that the non-standard meaning of size
doesn't have to be explained. view(records)
would give an error, for the same reason.
What about count
for total_occurrences
?
Re. read!
, I don't think it's a good idea, because we have a number of cases where we do something at the end of every page of request.
And as a third point, I don't think the Array should grow. We know its size at the beginning, and back when it did grew with each page, the requests were getting slower and slower.
So I'd vote for option 1, with maybe count
instead of total_occurrences
- go ahead with a PR?
size
andview
currently don't use base semantics. Single-argsize(x)
usually returns aTuple
, andview(x)
normally returns a zero dimensionalSubArray
.It might make sense if values added with
occurances!
are appended to an existing array that would start with zero length, so that none of these things are even needed. ThenGBIFRecords <: AbstractVector
would makes sense too, and all of these methods would have the normal semantics.length
would just be the length of the parentArray
.@mkborregaard that also means using
collect
to get all the records might not make sense? as it should return anArray
but we actually still want a GBIFRecords object, and if it collected all the records, it also ends up longer than whatlength
returns for theGBIFRecords
- which is also kind of confusing for howcollect
usually works.I feel like this is more like reading a stream to an array, so it's possible we could use e.g.
read!(oc)
to mean read the whole stream?