JuliaAI / ScientificTypes.jl

An API for dispatching on the "scientific" type of data instead of the machine type
MIT License
96 stars 8 forks source link

Separate `interval` and `ratio` types #194

Open ParadaCarleton opened 8 months ago

ParadaCarleton commented 8 months ago

I’ve noticed there’s no way to tell the difference between interval and ratio scales ATM. They’re both Continuous right now, but they’re not quite the same; ratio scales have a true zero value (e.g. Kelvin), while interval scales don’t (e.g. Fahrenheit or Celsius). This makes a big difference in some stats analyses; for example, you can say something is “twice as much” with ratio scales, but not with interval scales. These scales are useful since they let us throw errors when users perform invalid operations on interval scales (like taking logarithms or using MAPE); in addition, we can warn users when they make questionable decisions (like trying to do a linear regression with a ratio outcome, without taking the logarithm first).

A good heuristic for ratio types is all-positive values.

What do you think about adding new types to make this distinction possible?

(cc @juliohm for the same issue in DataScienceTraits.jl)

juliohm commented 8 months ago

Maybe you want to attach units to columns and check if the unit is AffineUnit. DataScienceTraits.jl will handle units from Unitful.jl and DynamicQyantities.jl gracefully.

Em qua., 1 de nov. de 2023 20:50, Carlos Parada @.***> escreveu:

I’ve noticed there’s no way to tell the difference between interval and ratio scales ATM. They’re both Continuous right now, but they’re not quite the same; ratio scales have a true zero value (e.g. Kelvin), while interval scales don’t (e.g. Fahrenheit or Celsius). This makes a big difference in some stats analyses; for example, you can say something is “twice as much” with ratio scales, but not with interval scales. These scales are useful since they let us throw errors when users perform invalid operations on interval scales (like taking logarithms or using MAPE); in addition, we can warn users when they make questionable decisions (like trying to do a linear regression with a ratio outcome, without taking the logarithm first).

A good heuristic for ratio types is all-positive values.

What do you think about adding new types to make this distinction possible?

(cc @juliohm https://github.com/juliohm for the same issue in DataScienceTraits.jl)

— Reply to this email directly, view it on GitHub https://github.com/JuliaAI/ScientificTypes.jl/issues/194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3MG3CDTLNBBOZZNMR3YCLN3NAVCNFSM6AAAAAA62BKBLSVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TGMRXGA4TSMI . You are receiving this because you were mentioned.Message ID: @.***>

ParadaCarleton commented 8 months ago

I think that's reasonable for many cases, but it runs into two problems:

  1. Users not using units (since they're not required), and
  2. Sometimes, the same nominal units can land on either an interval or a ratio scale depending on the context. For instance, prices and net worth can both be denominated in dollars, but prices are on a ratio scale, while net worth is interval-scaled.
ablaom commented 8 months ago

Interesting suggestion, @ParadaCarleton, thank you.

I see the use-case, but what is missing from the proposal, as far as a ScientificTypes.jl solution is concerned, is what objects should be regarded as Ratio and which types Interval. That is, we can add new types, but we also need to overload scitype(x) and for this we need more than a heuristic. I agree that a units-based solution is unsatisfactory. Ideally, the distinction should depend only on type(x). (We already have to work quite hard to efficiently handle the current distinction between OrderedFactor and Multiclass, because CategoricalArray does not have ordered as a type parameter - i.e. can vary between objects of same type.) I don't see an obvious choice, and not any that won't be massively breaking.

These scales are useful since they let us throw errors when users perform invalid operations on interval scales (like taking logarithms or using MAPE); in addition, we can warn users when they make questionable decisions (like trying to do a linear regression with a ratio outcome, without taking the logarithm first).

Indeed a desire to embed these kinds of assurances in MLJ was part of the original motivation for scitypes. I have to admit, however, this turned out to be a lot more ambitious than I first thought. There is always this tension between telling the user "you shouldn't do that" and a desire to write generic code that can be used later in ways that you could not anticipate. And the extra complexity means adding and perfecting all those checks burns a lot of dev resources.

ParadaCarleton commented 8 months ago

There is always this tension between telling the user "you shouldn't do that" and a desire to write generic code that can be used later in ways that you could not anticipate.

Probably the easiest way to work around this is by warning, rather than erroring, in these situations.

That is, we can add new types, but we also need to overload scitype(x) and for this we need more than a heuristic.

Sounds like we'd need something like this (so we can just pay the cost once when we load the information, then just look up the types).