JuliaML / LearnBase.jl

Abstractions for Julia Machine Learning Packages
Other
17 stars 10 forks source link

Refactor ObsDim out of LearnBase and update related interfaces #44

Closed darsnack closed 3 years ago

darsnack commented 3 years ago

This is in response to #41 and #42. Based on discussions, I removed ObsDim from LearnBase. Correspondingly, I propose we update the interfaces that relied on it to use a obsdim keyword argument instead. Here's a summary of what that means:

Additionally, nobs is now restricted to StatsBase. For example, in MLDataPattern.jl, I implement StatsBase.nobs instead of LearnBase.nobs.

I think that covers the bulk of the changes. I've done some crude REPL regression testing to make sure that these changes don't cause performance penalties for common use cases like data stored in arrays. cc @juliohm @Evizero @oxinabox @racinmat

darsnack commented 3 years ago

I also think the docs on this topic could use some love (which I am willing to do). Currently, they are in hosted MLDataPattern.jl. Would it make sense to create some docs for LearnBase.jl and add a forward pointer to the section in the MLDataPattern.jl docs? Just so that someone looking to implement this interface can find the correct descriptions.

juliohm commented 3 years ago

Lovely @darsnack. Can you confirm that the code changes are mainly moving code from other.jl to observation.jl?

I agree with you :100: percent that we need more love in the docs here #38 In my opinion we should be concentrating the documentation efforts in LearnBase.jl as opposed to downstream packages.

juliohm commented 3 years ago

Also, I would like to ask the permission of other JuliaML members to add @darsnack to the organization? He is putting a lot of effort to improve the ML ecosystem in Julia, and it would be nice to give him access to infrastructure setups, etc.

darsnack commented 3 years ago

Yeah the code changes are removing the ObsDim module and type, copying getobs to observation.jl (mostly this), and adding the default routing when obsdim::Nothing to getobs(x, idx).

johnnychen94 commented 3 years ago

getobs(x, idx, ::ObsDim.Undefined) => getobs(x, idx, obsdim::Nothing): LearnBase.jl now contains the default routing that maps getobs(x, idx, ::Nothing) to getobs(x, idx)

Since ObsDim is removed, the dispatching-based API is useless and this default routing can be removed.

Also, I believe things will be clearer if we consistently make obsdim a keyword, as most of the other functions (cat, eachslice) in Base do. And it seems like obsdim is a better choice to dims.

darsnack commented 3 years ago

What about the case where someone wants to implement getobs for their data container, and observation dimensions don't make sense for that data container? Currently, the developer only needs to define getobs(x::MyContainer, idx) and the default routing will ensure that if someone calls getobs(x::MyContainer, idx; obsdim = default_obsdim(x)) still routes to the getobs(x, idx) defined by the developer of MyContainer.

If we drop support for this, then the developer needs to implement getobs(x, idx, obsdim) and ignore the obsdim argument. That's not the end of the world, just making clear the trade-off. Personally, I agree that there should only be getobs(x, idx; obsdim) as an interface function, but I was trying to maintain some of the old functionality. Happy to make the switch though.

Also, a non-keyword based function is required to support the above behavior, since we can't dispatch on the type of keyword arguments.

johnnychen94 commented 3 years ago

What about the case where someone wants to implement getobs for their data container, and observation dimensions don't make sense for that data container?

As long as getindex(x::MyContainer, ind...) is well defined, this won't be an issue. ObsDim is definitely a concept in the array world and I don't think there will be a non-trivial non-array data container type in practice.

Also, the core logic in getobs is defined in https://github.com/JuliaML/MLDataPattern.jl/blob/ac4227ebb8ddefb87d480c6bd80f784c7971aac6/src/container.jl#L130-L144 can be simplified a lot without using generated function.

darsnack commented 3 years ago

As long as getindex(x::MyContainer, ind...) is well defined, this won't be an issue. ObsDim is definitely a concept in the array world and I don't think there will be a non-trivial non-array data container type in practice.

I do see your point. So would the change be to make the interface getobs(x, idx; obsdim) and default to getindex(x, idx) when the method is undefined?

Also, the core logic in getobs is defined in https://github.com/JuliaML/MLDataPattern.jl/blob/ac4227ebb8ddefb87d480c6bd80f784c7971aac6/src/container.jl#L130-L144 can be simplified a lot without using generated function.

Agreed. I created https://github.com/JuliaML/MLDataPattern.jl/pull/50 which removes a lot of the excess code.

johnnychen94 commented 3 years ago

So would the change be to make the interface getobs(x, idx; obsdim) and default to getindex(x, idx) when the method is undefined?

Yes, and all the complicated dispatch routes get simplified to getobs(x, idx; obsdim=default_obsdim(x)).

darsnack commented 3 years ago

Just a note on that change. If we use keyword arguments as the interface, then we can't dispatch on the type of obsdim, and we need to dispatch on LearnBase.getobs(tup::Tuple, indices, obsdims::Tuple). This can be fixed by doing special routing for the Tuple case, but it might be cleaner to support dispatch on obsdim. It adds flexibility, and it can be easily supported with a single line of routing logic:

LearnBase.getobs(data, idx; obsdim = default_obsdim(data)) = getobs(data, idx, obsdim)
Evizero commented 3 years ago

Also, I would like to ask the permission of other JuliaML members to add darsnack to the organization? He is putting a lot of effort to improve the ML ecosystem in Julia, and it would be nice to give him access to infrastructure setups, etc.

@juliohm As far as I am concerned I trust your judgment in these matters. Also I just noticed your were a "member" of the org so I upgraded your membership status as you are as much an "owner" as I have ever been. Thank you for all that you do

juliohm commented 3 years ago

Thank you @Evizero, this organization wouldn't even exist without your work. I appreciate it. I will go ahead and add @darsnack as a member of the organization.

darsnack commented 3 years ago

Just an update, I have ported the changes from this PR to MLLabelUtils. MLDataPattern is in progress. I have finished refactoring the portions of MLDataPattern that don't involve the targets. Now that MLLabelUtils is ready, I can refactor the targets.

Hoping to get this done this week so I can stop having it hang around the back of my brain.

johnnychen94 commented 3 years ago

Although not participated, I'm so glad to see the enormous progress on the Flux side.

CarloLucibello commented 3 years ago

What is missing here instead? Changes can be propagated with no hurry in downstream packages once we merge this PR and tag a major release

juliohm commented 3 years ago

I've merged #45 to migrate from Travis/AppVeyor to GitHub CI. Thanks @CarloLucibello for the update.

I will try to close and reopen this PR to see if the GitHub Actions are triggered now.

darsnack commented 3 years ago

Think I will need to rebase for CI to trigger correctly?

@CarloLucibello no changes left, but since this is just an interface, we can't test whether it will work without refactoring the downstream packages.