Alexander-Barth / NCDatasets.jl

Load and create NetCDF files in Julia
MIT License
146 stars 31 forks source link

`@select` with arbitrary (or a few specific) functions #203

Closed haakon-e closed 1 year ago

haakon-e commented 1 year ago

I primarily want to do this:

julia> NCDatasets.@select(ds, ismissing(temperature))

# or

julia> NCDatasets.@select(ds, temperature |> ismissing)

particularly since missing is the default FillValue as I understand it.

More generally, this could potentially open the door for arbitrary one- (or multiple-) argument functions.

Alexander-Barth commented 1 year ago

I agree that this would be nice! I guess we should allow then any kind of operator/function.

(this probably already works right now, but it is clunky :-)

NCDatasets.@select(ds, ismissing(temperature) == true)

)

haakon-e commented 1 year ago

Yeah, in principle any function/operator should be allowed, but it may necessitate cleaning up or rethinking the code a bit (e.g. your comment regarding eval, etc).

I got around it yesterday by doing what @select essentially does (at least how I understand it), e.g.:

nonmissing_inds = findall(!ismissing, ds["temperature"][:])
sub_ds = view(ds; time = nonmissing_inds)

where temperature is my relevant data field and time is the data dimension. Thanks for the tip with @select though!

Alexander-Barth commented 1 year ago

In this commit https://github.com/Alexander-Barth/NCDatasets.jl/commit/fa3744dac8771d305be15854a30c59ea856d651a now all functions (returning a Bool) are accepted in @select.

This should work with the master version:

NCDatasets.@select(ds, ismissing(temperature))
NCDatasets.@select(ds,temperature |> ismissing)

Somewhat related: I think that it would be also possible to allow a functional form (without macro):

NCDatasets.select(ds, temperature -> ismissing(temperature)) # not yet implemented

Using this trick: https://discourse.julialang.org/t/get-the-argument-names-of-an-function/32902

This would avoid eval and the module scoping issue (https://github.com/Alexander-Barth/NCDatasets.jl/issues/200#issuecomment-1397281090). This syntax would be in addition to the macro (and the macro would call the select function if it will be implemented). Would this be useful?

Alexander-Barth commented 1 year ago

I did not implement NCDatasets.select(ds, temperature -> ismissing(temperature)), but these call work now:

https://github.com/Alexander-Barth/NCDatasets.jl/blob/ca2e8ad9ba08c8998e8f987295b182951869f5c8/test/test_select.jl#L270

https://github.com/Alexander-Barth/NCDatasets.jl/blob/ca2e8ad9ba08c8998e8f987295b182951869f5c8/test/test_select.jl#L331

Or without a macro (but not documenented because I am not quite sure about the syntax):

https://github.com/Alexander-Barth/NCDatasets.jl/blob/ca2e8ad9ba08c8998e8f987295b182951869f5c8/test/test_select.jl#L103

haakon-e commented 1 year ago

Thanks! The first two will be very useful going forward.

The syntax for the last one is a little confusing at first glance, but I'm happy to assist with iterating on the syntax whenever you want to revisit that.

Alexander-Barth commented 1 year ago

The last syntax is actually borrowed by DataFrames.jl

https://juliadatascience.io/filter_subset

The general idea is to have :parameter_name => function_name (like :data => ismissing or :data => !ismissing). DataFrames.jl is quite popular in the julia data community.

The alternative

NCDatasets.select(ds, temperature -> ismissing(temperature)) # not yet implemented

requires unfortunately an unexported function from julia Base or calling julia's C API. Do you have another idea?

haakon-e commented 1 year ago

I haven't used DataFrames.jl, but looking through the documentation you sent, I do see how it can be explained in a good way. So I would be positive to this as long as we are able to document this well. I am happy to hear that users may benefit from being familiar with the syntax for another package.

After reading about it, one thing I do like about their approach is that you can do something like

hightemp_or_lowsalinity(temp, salinity)
    return temp > 20 || salinity < 34
end

NCDatasets.select(ds, (:temperature, :salinity) => hightemp_or_lowsalinity)
# or
NCDatasets.select(ds, (:temperature, :salinity) => T, S -> T > 20 || S < 34)

which afaik wasn't really possible in the current @select implementation since it assumed && for different conditions.


On the note of @select; would it be possible to avoid eval and similar issues if you directly passed CFVariables? For example;

temperature = ds["temperature"]
salinity = ds["salinity"]
NCDatasets.@select(ds, some_function(arg1, temperature, arg2 && 34 < salinity < 36)

it's not as compact as the current implementation, but in theory would avoid the issue of not knowing whether a symbol is a variable/dimension of the dataset? or am I missing something?