Closed juliohm closed 4 years ago
Is there a way to make this work without adding the 5 (direct and indirect) dependencies?
I think just using DataAPI would add support for CategoricalArrays/PooledArrays, etc.
@juliohm. Sorry to intrude. You may want to remove LossFunctions as a dependency in docs/Project.toml
file. That should be what's causing the failure. Also you want to consider adding the docs/Manifest.toml
the gitignore file.
Is there a way to make this work without adding the 5 (direct and indirect) dependencies?
I also considered other options, including this issue that I opened in CategoricalArrays.jl to see if a CategoricalValue could simply be a subtype of Number: https://github.com/JuliaData/CategoricalArrays.jl/issues/268 I am divided in the issue. What is your point of view @joshday ? Should we push it forward the idea of CategoricalValue <: Number
or rely on scientific types instead and traits? This second option could be addressed in a separate PR.
I think just using DataAPI would add support for CategoricalArrays/PooledArrays, etc.
Nice. I was unaware of the DataAPI.jl effort :+1: Unfortunately it doesn't have the CategoricalValue
name defined there, but we could open an issue. I am just trying to brainstorm more carefully what is the best path forward. Part of me thinks that most scientific methods out there actually discuss scientific types as opposed to machine type or user types like CategoricalValue
Float64
, etc. We could perhaps consider the ScientificTypes.jl package + traits.
Meanwhile, I will finish this PR with some preliminary implementation, and then we can get back to this more profound issue on "data type dependencies" in the Julia ecosystem. It is a big issue. Thanks for pointing it out.
@juliohm. Sorry to intrude. You may want to remove LossFunctions as a dependency in
docs/Project.toml
file. That should be what's causing the failure. Also you want to consider adding thedocs/Manifest.toml
the gitignore file.
Thank you @OkonSamuel , I will fix the docs. For some reason I've committed some unnecessary changes to the docs Project.toml + Manifest.toml files.
Unfortunately it doesn't have the CategoricalValue name defined there
Take a look at the refvalue
function. I think that covers your use case, but it looks like CategoricalArrays doesn't implement it (yet).
Take a look at the
refvalue
function. I think that covers your use case, but it looks like CategoricalArrays doesn't implement it (yet).
Hi @joshday could you please clarify the usage of refvalue
? I tried reading the docs, but it isn't very clear to me how we could use it in this case.
You may want to remove LossFunctions as a dependency in
docs/Project.toml
file
@OkonSamuel could you please clarify what is happening? The docs build successfully in my local machine. Also I have the same setup in GeoStats.jl docs for example and never had an issue.
@OkonSamuel could you please clarify what is happening? The docs build successfully in my local machine. Also I have the same setup in GeoStats.jl docs for example and never had an issue.
@juliohm The package name is already an implicit dependency and there is no need to add it. Also the main cause of the doc build error is in the docs/Manifest.toml file. The Julia registry doesn't recognize version 0.5.2 and the given uuid causing an error. You can comfirm this by trying to get v0.5.2 of the LossFunctions package on your system. PS: That was why i recommended removing the docs/Manifest.toml file
Thank you @OkonSamuel I get the issue now. The incorrect UUID I commited some time ago still causing issues. I will delete the Manifest.toml and commit a new one. It is usually a good idea to commit the Manifest.toml in the docs, and update it every now and then. That is what I've learned from other project maintainers.
https://github.com/JuliaML/LossFunctions.jl/pull/130#issuecomment-614005327
The documentation failure can be fixed by removing docs/Manifest.toml; it locks a different uuid of LossFunctions (ref: ad125ad)
Checking in a docs/Manifest.toml
has advantages by locking dependency versions in pre-1.0 epoch. Nowadays since we have better tools (e.g., Pkg, CompatHelper, Registrator, Documenter) to maintain packages and people do treat the semantic versions seriously, it's okay to only keep a docs/Project.toml
and it exposes issues as early as possible; a broken dev
docs could be acceptable.
Instead of checking in docs/Manifest.toml
, maintaining a compat
section for docs/Project.toml
is a better practice IMO.
You may want to remove LossFunctions as a dependency in docs/Project.toml file.
@OkonSamuel It's okay to still keep LossFunctions
since it helps to build docs in local machine. For CI purpose, the following command in travis config helps checkout LossFunction
to the dev version. It's very flexible as long as we don't check in the docs/Manifest.toml
.
- julia --project=docs/ -e 'using Pkg; Pkg.add(PackageSpec(path=pwd()))'
I don't think refvalue
would help as it's mainly useful for performance. Here what @juliohm seems to be looking for is a way to treat CategoricalValue
differently from other types. TBH I don't really understand why: why not accept any value which is not a Number
and treat it as categorical? That could include strings in particular.
The categorical case is a specific case as opposed to a general case @nalimilan. I was wondering why a categorical value could not be seen as a number in the sense that it is usually used as a number? The PR is fine as is with this Union{CategoricalValue,Number}
alias, but it seems unnecessary.
On a more general view of the issue, I think we need to come up with a data-independent way of expressing methods. I learned about DataAPI.jl from @joshday for example, but I didn't quite get how the package can help with this. I am personally starting to think that ScientificTypes.jl is a possible solution where we describe methods with their scientific semantic as opposed to their machine type. However, to do this we need a set of traits to dispatch on. For example, in this case I think a trait like "numerical scalar" could be implemented as
isnum(x) = x <: Union{Finite{N} where N, Infinite}
where Finite
and Infinite
come from ScientificTypes.jl.
CategoricalValue
is not usually used as a number, rather like a string (but with custom ordering of levels).
I tried removing the ::Number
type annotation from the value(loss::DistanceLoss, target, output)
method, but then we have an ambiguity with the array methods value(loss::SupervisedLoss, targets::Array, outputs::Array)
. I will proceed and add a few tests for the current implementation with explicit CategoricalValue
for now.
In the next recycling of LearnBase.jl + LossFunctions.jl, I will consider this data dependency issue more seriously. We need to open an issue on GitHub to track it more closely.
My apologies! I was confused about how DataAPI works. A few questions though:
With this PR, is it the case that you can calculate a MisclassLoss
on a CategoricalArray
but not a Vector{String}
? That's a red flag to me that this isn't the right solution.
Why not generalize ZeroOneLoss
directly and remove MisclassLoss
?
I tried removing the ::Number type annotation from the value(loss::DistanceLoss, target, output) method, but then we have an ambiguity with the array methods value(loss::SupervisedLoss, targets::Array, outputs::Array)
DistanceLoss
. What about changing value(loss::MisclassLoss, target::CV, output::CV)
to
value(loss::MisclassLoss, target, output)
and then add the necessary methods for
value(loss::MisclassLoss, target::Array, output::Array)
?
:exclamation: No coverage uploaded for pull request base (
master@a133857
). Click here to learn what that means. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #132 +/- ##
=========================================
Coverage ? 93.88%
=========================================
Files ? 11
Lines ? 572
Branches ? 0
=========================================
Hits ? 537
Misses ? 35
Partials ? 0
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a133857...12fc82a. Read the comment docs.
- With this PR, is it the case that you can calculate a
MisclassLoss
on aCategoricalArray
but not aVector{String}
? That's a red flag to me that this isn't the right solution.
Yes, this is correct. From what I understand, categorical values a la CategoricalArrays.jl need to be explicit in user code. So if I have a vector of string, I need to categorical(v)
to turn this generic strings into categorical values. So an array of CategoricalValue{String}
makes sense, which is the CategoricalArray
type supported, but a general vector with strings does not contain the same semantic.
- Why not generalize
ZeroOneLoss
directly and removeMisclassLoss
?
Initially I tried to do this in a separate PR, and realized that these are not the same in the sense that ZeroOneLoss
is a margin-based loss and MisclassLoss
is not. The "agreement" or representing function in zero-one loss is y*yhat
whereas in the misclassification loss, it is y == yhat
.
- MisclassLoss isn't a
DistanceLoss
.
Yes, but it is a SupervisedLoss
, and so we get ambiguity between
value(::SupervisedLoss, ::Array, ::Array)
and
value(MisclassLoss, ::Any, ::Any)
And just to clarify, this is for any subtype of SupervisedLoss
if we omit the ::Number
constraint. I wonder if there is an elegant way out of it. Ideally we would just omit ::Number
and have everything working. I think we cannot solve this properly without traits currently.
So if I have a vector of string, I need to
categorical(v)
to turn this generic strings into categorical values
Why do we need to force someone to do that who already has their v
?
I see you're still adding commits, but I'll say I don't think adding CategoricalArrays as a dependency is acceptable for this. What about PooledArrays? What about Vector{String}?
Why do we need to force someone to do that who already has their
v
?
I think that is the standard in other parts of the ecosystem. For example, DataFrames.jl expects users to explicitly convert the columns with categorical
(https://juliadata.github.io/DataFrames.jl/stable/man/categorical). MLJ.jl also expects users to be explicit about categorical values. If you pass a vector of strings to a function that expects categorical values, these packages will throw an error.
I see you're still adding commits, but I'll say I don't think adding CategoricalArrays as a dependency is acceptable for this. What about PooledArrays? What about Vector{String}?
I am also not happy with the fact that we need to add CategoricalArrays.jl as a dependency here. I'd like to think of this PR as an intermediate solution before we can keep refactoring the whole LearnBase.jl + LossFunctions.jl experience. I think many things can improve in the future, but I don't see a way out of this dependency currently without the traits in place, and other major changes in the codebase. My idea was to merge this PR as is, to at least have something working temporarily with categorical predictions produced by learning models. Afterwards, I plan to come back to LearnBase.jl again to incorporate those suggestions by @CarloLucibello , and then come back here again to fix other major issues. There are too many things to improve, but I think we need to go step-by-step.
Do you think it is ok to merge this PR knowing this dependency on CategoricalArrays.jl can go away in the future? I am fully aligned with your vision @joshday, but refactoring the whole codebase again is not something quick to do, or something that I can afford at the moment given other deadlines on my shoulders.
I think that is the standard in other parts of the ecosystem. For example, DataFrames.jl expects users to explicitly convert the columns with
categorical
(https://juliadata.github.io/DataFrames.jl/stable/man/categorical).
We never said that. You can just convert string columns to categorical if you want, mainly for two things: 1) use a custom order of levels, 2) improve performance of some operations like grouping. But e.g. StatsModels handles string columns just like categorical columns. You just need to call levels
on the vector, which will give you levels either in their custom order (for CategoricalArray
) or in lexicographic order (for other values).
I think we have many steps until we get a version of LossFunctions.jl
that is ideal. Supporting other types of "vectors of string" is a reasonable goal, but it goes beyond this PR in my opinion. Not only we need to address a more generic data-independent mechanism to define these methods (without depending on CategoricalArrays.jl
, PooledArrays.jl
, FooBarArrays.jl
), we also need to brainstorm the current support for general N-dimensional arrays of 0s and 1s, which assumes a fixed encoding of categorical values, as far as I understand.
On a similar argument, the dependency on SparseArrays.jl
could be dropped with a more general design where losses are defined between vectors of objects satisfying some traits. Currently we have various copies of the implementations for various combinations of argument types because of this encoding decision and because of the ObsDim
concept. I plan to address all these issues in future PRs with your feedback, but that is a long-term goal.
If you are ok with this plan, I can go ahead and:
dims=1
as opposed to dispatch with ObsDim.First()
and use agg=mean
as opposed to dispatch with AggMode.Mean()
. At the same time, this refactoring will be useful to clarify the need for supporting N-dimensional arrays. Maybe we can live with only 1D arrays of general objects like categorical values, and let users choose their desired encoding.The step 1) is what is currently blocking my research. The steps 2-4) are what I want to work on with more time, and without these upcoming deadlines.
I am also not happy with the fact that we need to add CategoricalArrays.jl as a dependency here.
I just don't get the motivation for adding it. Why go through multiple breaking changes (adding and then removing a dep) on a repo that usually doesn't have a lot of activity, just so you have a more limited version of mean(x .== y)
as part of the interface?
I think we have many steps until we get a version of LossFunctions.jl that is ideal.
I think the concerns I've seen raised about LossFunctions could be handled in a single PR (basically your step 3). "Many steps" is tiring for maintainers and I don't think it's necessary.
I'm glad that you're willing to put some work in to clean up JuliaML packages, but please find a workaround for your current research.
Thank you @joshday for the feedback. Is there a solution in between maybe where this PR is adapted to drop the dependency on CategoricalArrays.jl? I think this is your major concern about this PR, right? I just don't see how we can drop the dependency on CategoricalArrays.jl without traits. If you have suggestions on how to do this, we could keep working here as opposed to starting it all over.
Regarding the step (3) you mentioned, I also don't see how it addresses the dependency issue on CategoricalArrays.jl Could you please clarify? I see how it addresses other issues in the interface, but I can't see how these changes solve the dependency on data types.
Why go through multiple breaking changes (adding and then removing a dep) on a repo that usually doesn't have a lot of activity, just so you have a more limited version of
mean(x .== y)
as part of the interface?
The ==
here is a very powerful operation. It opens the door to many other types that are not just numbers. This is what is needed for multiclass deterministic losses, and later what it is also what is needed for probabilistic functional losses.
If you have suggestions on how to do this, we could keep working here as opposed to starting it all over.
Rely on multiple dispatch:
value(loss::MisclassLoss, target, output) = target == output
deriv(loss::MisclassLoss, target, output) = 0
deriv2(loss::MisclassLoss, target, output) = 0
for f in [:value, :deriv, :deriv2]
@eval function $f(loss::MisclassLoss, targets::AbstractArray, outputs::AbstractArray)
$f(loss, targets, outputs, AggMode.None())
end
end
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.None) = target .== output
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.Sum) = sum(target .== output)
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.Mean) = mean(target .== output)
(and add the methods I've missed)
Got it. My previous attempt consisted of removing the ::Number
constraint from the general value(::SupervisedLoss, target, output)
method for all supervised losses, and that created ambiguity. We can workaround the ambiguity by redefining all methods just for MisclassLoss
. I will push the proposed changes by the end of the day.
@joshday I've removed the dependency on CategoricalArrays.jl. Appreciate if you can take a look.
This PR is work-in-progress. It attempts to generalize the functionality of the package to support
CategoricalArray
andCategoricalValue
from CategoricalArrays.jl. Fix #117.I will share a comment here when it is ready for review.