JuliaML / LossFunctions.jl

Julia package of loss functions for machine learning.
https://juliaml.github.io/LossFunctions.jl/stable
Other
148 stars 34 forks source link

Add support for CategoricalArrays.jl #132

Closed juliohm closed 4 years ago

juliohm commented 4 years ago

This PR is work-in-progress. It attempts to generalize the functionality of the package to support CategoricalArray and CategoricalValue from CategoricalArrays.jl. Fix #117.

I will share a comment here when it is ready for review.

joshday commented 4 years ago

Is there a way to make this work without adding the 5 (direct and indirect) dependencies?

I think just using DataAPI would add support for CategoricalArrays/PooledArrays, etc.

OkonSamuel commented 4 years ago

@juliohm. Sorry to intrude. You may want to remove LossFunctions as a dependency in docs/Project.toml file. That should be what's causing the failure. Also you want to consider adding the docs/Manifest.toml the gitignore file.

juliohm commented 4 years ago

Is there a way to make this work without adding the 5 (direct and indirect) dependencies?

I also considered other options, including this issue that I opened in CategoricalArrays.jl to see if a CategoricalValue could simply be a subtype of Number: https://github.com/JuliaData/CategoricalArrays.jl/issues/268 I am divided in the issue. What is your point of view @joshday ? Should we push it forward the idea of CategoricalValue <: Number or rely on scientific types instead and traits? This second option could be addressed in a separate PR.

I think just using DataAPI would add support for CategoricalArrays/PooledArrays, etc.

Nice. I was unaware of the DataAPI.jl effort :+1: Unfortunately it doesn't have the CategoricalValue name defined there, but we could open an issue. I am just trying to brainstorm more carefully what is the best path forward. Part of me thinks that most scientific methods out there actually discuss scientific types as opposed to machine type or user types like CategoricalValue Float64, etc. We could perhaps consider the ScientificTypes.jl package + traits.

Meanwhile, I will finish this PR with some preliminary implementation, and then we can get back to this more profound issue on "data type dependencies" in the Julia ecosystem. It is a big issue. Thanks for pointing it out.

juliohm commented 4 years ago

@juliohm. Sorry to intrude. You may want to remove LossFunctions as a dependency in docs/Project.toml file. That should be what's causing the failure. Also you want to consider adding the docs/Manifest.toml the gitignore file.

Thank you @OkonSamuel , I will fix the docs. For some reason I've committed some unnecessary changes to the docs Project.toml + Manifest.toml files.

joshday commented 4 years ago

Unfortunately it doesn't have the CategoricalValue name defined there

Take a look at the refvalue function. I think that covers your use case, but it looks like CategoricalArrays doesn't implement it (yet).

juliohm commented 4 years ago

Take a look at the refvalue function. I think that covers your use case, but it looks like CategoricalArrays doesn't implement it (yet).

Hi @joshday could you please clarify the usage of refvalue? I tried reading the docs, but it isn't very clear to me how we could use it in this case.

You may want to remove LossFunctions as a dependency in docs/Project.toml file

@OkonSamuel could you please clarify what is happening? The docs build successfully in my local machine. Also I have the same setup in GeoStats.jl docs for example and never had an issue.

OkonSamuel commented 4 years ago

@OkonSamuel could you please clarify what is happening? The docs build successfully in my local machine. Also I have the same setup in GeoStats.jl docs for example and never had an issue.

@juliohm The package name is already an implicit dependency and there is no need to add it. Also the main cause of the doc build error is in the docs/Manifest.toml file. The Julia registry doesn't recognize version 0.5.2 and the given uuid causing an error. You can comfirm this by trying to get v0.5.2 of the LossFunctions package on your system. PS: That was why i recommended removing the docs/Manifest.toml file

juliohm commented 4 years ago

Thank you @OkonSamuel I get the issue now. The incorrect UUID I commited some time ago still causing issues. I will delete the Manifest.toml and commit a new one. It is usually a good idea to commit the Manifest.toml in the docs, and update it every now and then. That is what I've learned from other project maintainers.

johnnychen94 commented 4 years ago

https://github.com/JuliaML/LossFunctions.jl/pull/130#issuecomment-614005327

The documentation failure can be fixed by removing docs/Manifest.toml; it locks a different uuid of LossFunctions (ref: ad125ad)

Checking in a docs/Manifest.toml has advantages by locking dependency versions in pre-1.0 epoch. Nowadays since we have better tools (e.g., Pkg, CompatHelper, Registrator, Documenter) to maintain packages and people do treat the semantic versions seriously, it's okay to only keep a docs/Project.toml and it exposes issues as early as possible; a broken dev docs could be acceptable.

Instead of checking in docs/Manifest.toml, maintaining a compat section for docs/Project.toml is a better practice IMO.

You may want to remove LossFunctions as a dependency in docs/Project.toml file.

@OkonSamuel It's okay to still keep LossFunctions since it helps to build docs in local machine. For CI purpose, the following command in travis config helps checkout LossFunction to the dev version. It's very flexible as long as we don't check in the docs/Manifest.toml.

- julia --project=docs/ -e 'using Pkg; Pkg.add(PackageSpec(path=pwd()))'
nalimilan commented 4 years ago

I don't think refvalue would help as it's mainly useful for performance. Here what @juliohm seems to be looking for is a way to treat CategoricalValue differently from other types. TBH I don't really understand why: why not accept any value which is not a Number and treat it as categorical? That could include strings in particular.

juliohm commented 4 years ago

The categorical case is a specific case as opposed to a general case @nalimilan. I was wondering why a categorical value could not be seen as a number in the sense that it is usually used as a number? The PR is fine as is with this Union{CategoricalValue,Number} alias, but it seems unnecessary.

On a more general view of the issue, I think we need to come up with a data-independent way of expressing methods. I learned about DataAPI.jl from @joshday for example, but I didn't quite get how the package can help with this. I am personally starting to think that ScientificTypes.jl is a possible solution where we describe methods with their scientific semantic as opposed to their machine type. However, to do this we need a set of traits to dispatch on. For example, in this case I think a trait like "numerical scalar" could be implemented as

isnum(x) = x <: Union{Finite{N} where N, Infinite}

where Finite and Infinite come from ScientificTypes.jl.

nalimilan commented 4 years ago

CategoricalValue is not usually used as a number, rather like a string (but with custom ordering of levels).

juliohm commented 4 years ago

I tried removing the ::Number type annotation from the value(loss::DistanceLoss, target, output) method, but then we have an ambiguity with the array methods value(loss::SupervisedLoss, targets::Array, outputs::Array). I will proceed and add a few tests for the current implementation with explicit CategoricalValue for now.

In the next recycling of LearnBase.jl + LossFunctions.jl, I will consider this data dependency issue more seriously. We need to open an issue on GitHub to track it more closely.

joshday commented 4 years ago

My apologies! I was confused about how DataAPI works. A few questions though:

  1. With this PR, is it the case that you can calculate a MisclassLoss on a CategoricalArray but not a Vector{String}? That's a red flag to me that this isn't the right solution.

  2. Why not generalize ZeroOneLoss directly and remove MisclassLoss?

I tried removing the ::Number type annotation from the value(loss::DistanceLoss, target, output) method, but then we have an ambiguity with the array methods value(loss::SupervisedLoss, targets::Array, outputs::Array)

  1. MisclassLoss isn't a DistanceLoss. What about changing

value(loss::MisclassLoss, target::CV, output::CV) to value(loss::MisclassLoss, target, output)

and then add the necessary methods for

value(loss::MisclassLoss, target::Array, output::Array)

?

codecov-io commented 4 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@a133857). Click here to learn what that means. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #132   +/-   ##
=========================================
  Coverage          ?   93.88%           
=========================================
  Files             ?       11           
  Lines             ?      572           
  Branches          ?        0           
=========================================
  Hits              ?      537           
  Misses            ?       35           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a133857...12fc82a. Read the comment docs.

juliohm commented 4 years ago
  1. With this PR, is it the case that you can calculate a MisclassLoss on a CategoricalArray but not a Vector{String}? That's a red flag to me that this isn't the right solution.

Yes, this is correct. From what I understand, categorical values a la CategoricalArrays.jl need to be explicit in user code. So if I have a vector of string, I need to categorical(v) to turn this generic strings into categorical values. So an array of CategoricalValue{String} makes sense, which is the CategoricalArray type supported, but a general vector with strings does not contain the same semantic.

  1. Why not generalize ZeroOneLoss directly and remove MisclassLoss?

Initially I tried to do this in a separate PR, and realized that these are not the same in the sense that ZeroOneLoss is a margin-based loss and MisclassLoss is not. The "agreement" or representing function in zero-one loss is y*yhat whereas in the misclassification loss, it is y == yhat.

  1. MisclassLoss isn't a DistanceLoss.

Yes, but it is a SupervisedLoss, and so we get ambiguity between

value(::SupervisedLoss, ::Array, ::Array)

and

value(MisclassLoss, ::Any, ::Any)

And just to clarify, this is for any subtype of SupervisedLoss if we omit the ::Number constraint. I wonder if there is an elegant way out of it. Ideally we would just omit ::Number and have everything working. I think we cannot solve this properly without traits currently.

joshday commented 4 years ago

So if I have a vector of string, I need to categorical(v) to turn this generic strings into categorical values

Why do we need to force someone to do that who already has their v?

I see you're still adding commits, but I'll say I don't think adding CategoricalArrays as a dependency is acceptable for this. What about PooledArrays? What about Vector{String}?

juliohm commented 4 years ago

Why do we need to force someone to do that who already has their v?

I think that is the standard in other parts of the ecosystem. For example, DataFrames.jl expects users to explicitly convert the columns with categorical (https://juliadata.github.io/DataFrames.jl/stable/man/categorical). MLJ.jl also expects users to be explicit about categorical values. If you pass a vector of strings to a function that expects categorical values, these packages will throw an error.

I see you're still adding commits, but I'll say I don't think adding CategoricalArrays as a dependency is acceptable for this. What about PooledArrays? What about Vector{String}?

I am also not happy with the fact that we need to add CategoricalArrays.jl as a dependency here. I'd like to think of this PR as an intermediate solution before we can keep refactoring the whole LearnBase.jl + LossFunctions.jl experience. I think many things can improve in the future, but I don't see a way out of this dependency currently without the traits in place, and other major changes in the codebase. My idea was to merge this PR as is, to at least have something working temporarily with categorical predictions produced by learning models. Afterwards, I plan to come back to LearnBase.jl again to incorporate those suggestions by @CarloLucibello , and then come back here again to fix other major issues. There are too many things to improve, but I think we need to go step-by-step.

Do you think it is ok to merge this PR knowing this dependency on CategoricalArrays.jl can go away in the future? I am fully aligned with your vision @joshday, but refactoring the whole codebase again is not something quick to do, or something that I can afford at the moment given other deadlines on my shoulders.

nalimilan commented 4 years ago

I think that is the standard in other parts of the ecosystem. For example, DataFrames.jl expects users to explicitly convert the columns with categorical (https://juliadata.github.io/DataFrames.jl/stable/man/categorical).

We never said that. You can just convert string columns to categorical if you want, mainly for two things: 1) use a custom order of levels, 2) improve performance of some operations like grouping. But e.g. StatsModels handles string columns just like categorical columns. You just need to call levels on the vector, which will give you levels either in their custom order (for CategoricalArray) or in lexicographic order (for other values).

juliohm commented 4 years ago

I think we have many steps until we get a version of LossFunctions.jl that is ideal. Supporting other types of "vectors of string" is a reasonable goal, but it goes beyond this PR in my opinion. Not only we need to address a more generic data-independent mechanism to define these methods (without depending on CategoricalArrays.jl, PooledArrays.jl, FooBarArrays.jl), we also need to brainstorm the current support for general N-dimensional arrays of 0s and 1s, which assumes a fixed encoding of categorical values, as far as I understand.

On a similar argument, the dependency on SparseArrays.jl could be dropped with a more general design where losses are defined between vectors of objects satisfying some traits. Currently we have various copies of the implementations for various combinations of argument types because of this encoding decision and because of the ObsDim concept. I plan to address all these issues in future PRs with your feedback, but that is a long-term goal.

If you are ok with this plan, I can go ahead and:

  1. Merge this PR and tag a minor release with an updated version of the code.
  2. Start a new PR in LearnBase.jl to simplify the AggMode and ObsDim submodules (basically deprecate them).
  3. Refactor LossFunctions.jl again to reduce the number of implemented methods, and use keyword arguments instead. In particular, use dims=1 as opposed to dispatch with ObsDim.First() and use agg=mean as opposed to dispatch with AggMode.Mean(). At the same time, this refactoring will be useful to clarify the need for supporting N-dimensional arrays. Maybe we can live with only 1D arrays of general objects like categorical values, and let users choose their desired encoding.
  4. After the item 3) is reviewed and merged, we will finally be able to reconsider the dispatch rules (which will be minimal at this point) and to reconsider the explicit dependencies on CategoricalArrays.jl.

The step 1) is what is currently blocking my research. The steps 2-4) are what I want to work on with more time, and without these upcoming deadlines.

joshday commented 4 years ago

I am also not happy with the fact that we need to add CategoricalArrays.jl as a dependency here.

I just don't get the motivation for adding it. Why go through multiple breaking changes (adding and then removing a dep) on a repo that usually doesn't have a lot of activity, just so you have a more limited version of mean(x .== y) as part of the interface?

I think we have many steps until we get a version of LossFunctions.jl that is ideal.

I think the concerns I've seen raised about LossFunctions could be handled in a single PR (basically your step 3). "Many steps" is tiring for maintainers and I don't think it's necessary.

I'm glad that you're willing to put some work in to clean up JuliaML packages, but please find a workaround for your current research.

juliohm commented 4 years ago

Thank you @joshday for the feedback. Is there a solution in between maybe where this PR is adapted to drop the dependency on CategoricalArrays.jl? I think this is your major concern about this PR, right? I just don't see how we can drop the dependency on CategoricalArrays.jl without traits. If you have suggestions on how to do this, we could keep working here as opposed to starting it all over.

Regarding the step (3) you mentioned, I also don't see how it addresses the dependency issue on CategoricalArrays.jl Could you please clarify? I see how it addresses other issues in the interface, but I can't see how these changes solve the dependency on data types.

Why go through multiple breaking changes (adding and then removing a dep) on a repo that usually doesn't have a lot of activity, just so you have a more limited version of mean(x .== y) as part of the interface?

The == here is a very powerful operation. It opens the door to many other types that are not just numbers. This is what is needed for multiclass deterministic losses, and later what it is also what is needed for probabilistic functional losses.

joshday commented 4 years ago

If you have suggestions on how to do this, we could keep working here as opposed to starting it all over.

Rely on multiple dispatch:

value(loss::MisclassLoss, target, output) = target == output
deriv(loss::MisclassLoss, target, output) = 0
deriv2(loss::MisclassLoss, target, output) = 0

for f in [:value, :deriv, :deriv2]
    @eval function $f(loss::MisclassLoss, targets::AbstractArray, outputs::AbstractArray)
        $f(loss, targets, outputs, AggMode.None())
    end
end

value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.None) = target .== output
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.Sum) = sum(target .== output)
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.Mean) = mean(target .== output)

(and add the methods I've missed)

juliohm commented 4 years ago

Got it. My previous attempt consisted of removing the ::Number constraint from the general value(::SupervisedLoss, target, output) method for all supervised losses, and that created ambiguity. We can workaround the ambiguity by redefining all methods just for MisclassLoss. I will push the proposed changes by the end of the day.

juliohm commented 4 years ago

@joshday I've removed the dependency on CategoricalArrays.jl. Appreciate if you can take a look.