Closed darsnack closed 3 years ago
I also think the docs on this topic could use some love (which I am willing to do). Currently, they are in hosted MLDataPattern.jl. Would it make sense to create some docs for LearnBase.jl and add a forward pointer to the section in the MLDataPattern.jl docs? Just so that someone looking to implement this interface can find the correct descriptions.
Lovely @darsnack. Can you confirm that the code changes are mainly moving code from other.jl to observation.jl?
I agree with you :100: percent that we need more love in the docs here #38 In my opinion we should be concentrating the documentation efforts in LearnBase.jl as opposed to downstream packages.
Also, I would like to ask the permission of other JuliaML members to add @darsnack to the organization? He is putting a lot of effort to improve the ML ecosystem in Julia, and it would be nice to give him access to infrastructure setups, etc.
Yeah the code changes are removing the ObsDim
module and type, copying getobs
to observation.jl (mostly this), and adding the default routing when obsdim::Nothing
to getobs(x, idx)
.
getobs(x, idx, ::ObsDim.Undefined)
=>getobs(x, idx, obsdim::Nothing)
: LearnBase.jl now contains the default routing that mapsgetobs(x, idx, ::Nothing)
togetobs(x, idx)
Since ObsDim
is removed, the dispatching-based API is useless and this default routing can be removed.
Also, I believe things will be clearer if we consistently make obsdim
a keyword, as most of the other functions (cat
, eachslice
) in Base do. And it seems like obsdim
is a better choice to dims
.
What about the case where someone wants to implement getobs
for their data container, and observation dimensions don't make sense for that data container? Currently, the developer only needs to define getobs(x::MyContainer, idx)
and the default routing will ensure that if someone calls getobs(x::MyContainer, idx; obsdim = default_obsdim(x))
still routes to the getobs(x, idx)
defined by the developer of MyContainer
.
If we drop support for this, then the developer needs to implement getobs(x, idx, obsdim)
and ignore the obsdim
argument. That's not the end of the world, just making clear the trade-off. Personally, I agree that there should only be getobs(x, idx; obsdim)
as an interface function, but I was trying to maintain some of the old functionality. Happy to make the switch though.
Also, a non-keyword based function is required to support the above behavior, since we can't dispatch on the type of keyword arguments.
What about the case where someone wants to implement getobs for their data container, and observation dimensions don't make sense for that data container?
As long as getindex(x::MyContainer, ind...)
is well defined, this won't be an issue. ObsDim
is definitely a concept in the array world and I don't think there will be a non-trivial non-array data container type in practice.
Also, the core logic in getobs
is defined in
https://github.com/JuliaML/MLDataPattern.jl/blob/ac4227ebb8ddefb87d480c6bd80f784c7971aac6/src/container.jl#L130-L144 can be simplified a lot without using generated function.
As long as getindex(x::MyContainer, ind...) is well defined, this won't be an issue. ObsDim is definitely a concept in the array world and I don't think there will be a non-trivial non-array data container type in practice.
I do see your point. So would the change be to make the interface getobs(x, idx; obsdim)
and default to getindex(x, idx)
when the method is undefined?
Also, the core logic in getobs is defined in https://github.com/JuliaML/MLDataPattern.jl/blob/ac4227ebb8ddefb87d480c6bd80f784c7971aac6/src/container.jl#L130-L144 can be simplified a lot without using generated function.
Agreed. I created https://github.com/JuliaML/MLDataPattern.jl/pull/50 which removes a lot of the excess code.
So would the change be to make the interface getobs(x, idx; obsdim) and default to getindex(x, idx) when the method is undefined?
Yes, and all the complicated dispatch routes get simplified to getobs(x, idx; obsdim=default_obsdim(x))
.
Just a note on that change. If we use keyword arguments as the interface, then we can't dispatch on the type of obsdim
, and we need to dispatch on LearnBase.getobs(tup::Tuple, indices, obsdims::Tuple)
. This can be fixed by doing special routing for the Tuple
case, but it might be cleaner to support dispatch on obsdim
. It adds flexibility, and it can be easily supported with a single line of routing logic:
LearnBase.getobs(data, idx; obsdim = default_obsdim(data)) = getobs(data, idx, obsdim)
Also, I would like to ask the permission of other JuliaML members to add darsnack to the organization? He is putting a lot of effort to improve the ML ecosystem in Julia, and it would be nice to give him access to infrastructure setups, etc.
@juliohm As far as I am concerned I trust your judgment in these matters. Also I just noticed your were a "member" of the org so I upgraded your membership status as you are as much an "owner" as I have ever been. Thank you for all that you do
Thank you @Evizero, this organization wouldn't even exist without your work. I appreciate it. I will go ahead and add @darsnack as a member of the organization.
Just an update, I have ported the changes from this PR to MLLabelUtils. MLDataPattern is in progress. I have finished refactoring the portions of MLDataPattern that don't involve the targets. Now that MLLabelUtils is ready, I can refactor the targets.
Hoping to get this done this week so I can stop having it hang around the back of my brain.
Although not participated, I'm so glad to see the enormous progress on the Flux side.
What is missing here instead? Changes can be propagated with no hurry in downstream packages once we merge this PR and tag a major release
I've merged #45 to migrate from Travis/AppVeyor to GitHub CI. Thanks @CarloLucibello for the update.
I will try to close and reopen this PR to see if the GitHub Actions are triggered now.
Think I will need to rebase for CI to trigger correctly?
@CarloLucibello no changes left, but since this is just an interface, we can't test whether it will work without refactoring the downstream packages.
This is in response to #41 and #42. Based on discussions, I removed
ObsDim
from LearnBase. Correspondingly, I propose we update the interfaces that relied on it to use aobsdim
keyword argument instead. Here's a summary of what that means:getobs(x, idx, ::ObsDim.Undefined)
=>getobs(x, idx, obsdim::Nothing)
: LearnBase.jl now contains the default routing that mapsgetobs(x, idx, ::Nothing)
togetobs(x, idx)
. A developer implementing the interface can still choose to only implementgetobs(x, idx)
if the observation dimension is nonsensical.getobs(x, idx; obsdim = default_obsdim(x))
: This is the keyword version ofgetobs
. By default,default_obsdim(x) = nothing
, and for arraysdefault_obsdim(x) = ndims(x)
(in line with popular packages like Flux.jl). Most importantly, the docs will make it clear that the non-keyword argument functions are the ones to implement. This keyword argument version is for end-user convenience.getobs(x, obsdim)
: This is no longer going to be supported. It conflicts for dispatch withgetobs(x, idx)
, and the routing logic required to make multiple dispatch work is not worth it in my opinion. I believe this function and the correspondinggetobs(x)
existed for cases where fetching all observations at once is more efficient than fetching all observations one at a time. I think we can still support this optimization by dispatching ongetobs(x, 1:nobs(x))
since1:nobs(x) <: UnitRange
.getobs(x)
: Same as above.Additionally,
nobs
is now restricted toStatsBase
. For example, in MLDataPattern.jl, I implementStatsBase.nobs
instead ofLearnBase.nobs
.I think that covers the bulk of the changes. I've done some crude REPL regression testing to make sure that these changes don't cause performance penalties for common use cases like data stored in arrays. cc @juliohm @Evizero @oxinabox @racinmat