Closed haakon-e closed 1 year ago
I agree that this would be nice! I guess we should allow then any kind of operator/function.
(this probably already works right now, but it is clunky :-)
NCDatasets.@select(ds, ismissing(temperature) == true)
)
Yeah, in principle any function/operator should be allowed, but it may necessitate cleaning up or rethinking the code a bit (e.g. your comment regarding eval
, etc).
I got around it yesterday by doing what @select
essentially does (at least how I understand it), e.g.:
nonmissing_inds = findall(!ismissing, ds["temperature"][:])
sub_ds = view(ds; time = nonmissing_inds)
where temperature
is my relevant data field and time
is the data dimension. Thanks for the tip with @select
though!
In this commit https://github.com/Alexander-Barth/NCDatasets.jl/commit/fa3744dac8771d305be15854a30c59ea856d651a now all functions (returning a Bool
) are accepted in @select
.
This should work with the master version:
NCDatasets.@select(ds, ismissing(temperature))
NCDatasets.@select(ds,temperature |> ismissing)
Somewhat related: I think that it would be also possible to allow a functional form (without macro):
NCDatasets.select(ds, temperature -> ismissing(temperature)) # not yet implemented
Using this trick: https://discourse.julialang.org/t/get-the-argument-names-of-an-function/32902
This would avoid eval
and the module scoping issue (https://github.com/Alexander-Barth/NCDatasets.jl/issues/200#issuecomment-1397281090). This syntax would be in addition to the macro (and the macro would call the select
function if it will be implemented). Would this be useful?
I did not implement NCDatasets.select(ds, temperature -> ismissing(temperature))
, but these call work now:
Or without a macro (but not documenented because I am not quite sure about the syntax):
Thanks! The first two will be very useful going forward.
The syntax for the last one is a little confusing at first glance, but I'm happy to assist with iterating on the syntax whenever you want to revisit that.
The last syntax is actually borrowed by DataFrames.jl
https://juliadatascience.io/filter_subset
The general idea is to have :parameter_name => function_name
(like :data => ismissing
or :data => !ismissing
).
DataFrames.jl is quite popular in the julia data community.
The alternative
NCDatasets.select(ds, temperature -> ismissing(temperature)) # not yet implemented
requires unfortunately an unexported function from julia Base or calling julia's C API. Do you have another idea?
I haven't used DataFrames.jl, but looking through the documentation you sent, I do see how it can be explained in a good way. So I would be positive to this as long as we are able to document this well. I am happy to hear that users may benefit from being familiar with the syntax for another package.
After reading about it, one thing I do like about their approach is that you can do something like
hightemp_or_lowsalinity(temp, salinity)
return temp > 20 || salinity < 34
end
NCDatasets.select(ds, (:temperature, :salinity) => hightemp_or_lowsalinity)
# or
NCDatasets.select(ds, (:temperature, :salinity) => T, S -> T > 20 || S < 34)
which afaik wasn't really possible in the current @select
implementation since it assumed &&
for different conditions.
On the note of @select
; would it be possible to avoid eval
and similar issues if you directly passed CFVariables
? For example;
temperature = ds["temperature"]
salinity = ds["salinity"]
NCDatasets.@select(ds, some_function(arg1, temperature, arg2 && 34 < salinity < 36)
it's not as compact as the current implementation, but in theory would avoid the issue of not knowing whether a symbol is a variable/dimension of the dataset? or am I missing something?
I primarily want to do this:
particularly since
missing
is the defaultFillValue
as I understand it.More generally, this could potentially open the door for arbitrary one- (or multiple-) argument functions.