On the form of input data for the MLJ interface

ablaom commented 1 year ago

I'm staring to look over the MLJ interface. A fundamental issue is the current form of input data, which I understand should be in the form (X_train, :column_based) or (X_train, :row_based) where X is what exactly?

Passing the row/column flag in this way is quite non-standard for MLJ models. In any case, it means the currently declared input_scitype does not match the data requirements.

You can pass metadata like this flag as a third argument, as in machine(model, X, y, flag) but before going that route, can you say more about what data you can accept for X and why you need this distinction? Can we instead deal with row-versus-column issue by having the user provide a lazy transpose, if her data does not conform to your preferred format?

antoninkriz commented 1 year ago

Good point! This thing was a hack to make things simple when developing the model and I guess it became a feature along the way. The idea is to make sure the features are in rows and samples are in columns, i.e. single time series / sample is a column of a matrix and the values (measurements) of the time series are values in the column.

Since Julia is column major (considering n-dimensional arrays) making sure the input data are column major as well significantly improves the performance of the algorithms compared to just naive re-implementation one-to-one in Python, which is row major. Lazy transpose (transpose(...)) wold nullify the performance benefit of having samples in columns.

Since MLJ assumes that "columns are features by default i.e. n x p" [src] the idea behind this model is also not easily compatible without running permutedims(...) (MMI.mastrix(..., transpose=true)) on all input data. permutedims(...) will always create a copy of all data, which is an option, but time series datasets can get quite large and it's not uncommon to work with tens of gigabytes of data.

I guess there has to be a nicer and cleaner solution to this, but I haven't found it yet. Or is there some solution I'm missing?

Also, thank you very much for such a quick and elaborate response!

ablaom commented 1 year ago

running permutedims(...) (MMI.mastrix(..., transpose=true)) on all input data. permutedims(...) will always create a copy of all data, which is an option, but time series datasets can get quite large and it's not uncommon to work with tens of gigabytes of data.

Yes, that is why I suggest internally using transpose (or adjoint) because transpose is lazy, ie, it is a view of the data not a copy. Yes, if the user provides a regular n x p Matrix, your algorithm will not perform as well. But the user can overcome this by providing the transpose of a regular p x n Matrix. The two transposes will cancel out with no copy (the compiler may even reduce this to a no-op) and user recovers the performance. Of course I am assuming your code is generic, in the sense that it operates on AbstractMatrix and is not hard-wired to require <:Matrix.

# This algorithm to reverse each observation vector is generic and supposes observations
# are columns (uncompliant with MLJ convention).
uncompliant_algorithm(X::AbstractMatrix) =
    hcat([reverse(X[:, j])) for j in size(X, 2)]...)

# This algorithm also reverses each observation and is generic but supposes observations are
# rows. There is no explicit copying, because `transpose(X)` is a view of `X`:
compliant_algorithm(X::AbstractMatrix) = uncompliant_algorithm(transpose(X)) |> transpose

using BenchmarkTools

Xraw = rand(100, 10000)
Xgood = transpose(Xraw)
Xbad = permutedims(Xraw)

# performance of original algorithm:
@btime uncompliant_algorithm($Xraw);
# 1.556 ms (10002 allocations: 8.62 MiB)

# compliant algorithm is degraded when using `Matrix`:
@btime compliant_algorithm($Xbad);
# 2.307 ms (10002 allocations: 8.62 MiB)

# but performance is recuperated if user uses `transpose(::Matrix)` type:
@btime compliant_algorithm($Xgood);
# 1.575 ms (10002 allocations: 8.62 MiB)

# In all three benchmarks, allocations are the same.

ablaom commented 1 year ago

I'd benchmark the degradation in your case. If it's really a concern, you could have your interface include something along these lines:

verbosity > 0 && @info "Input matrix `X` is a `Matrix`. Performance may be improved by providing "*
     "`transpose(permutedims(X))` instead. "

antoninkriz commented 12 months ago

I implemented the standard MLJ API, so if someone wants, they can pass the data using function(X), but I also kept the old API function((X, :column_based)), so if anyone wants, they can still go with the zero copy option or using function(transpose(X)).

This way no breaking changes are introduced and the interface should be, hopefully (?), compliant with other MLJ models and standards.

These changes and behavior are now also correctly highlighted in model documentation (most notably in this commit https://github.com/antoninkriz/TimeSeriesClassification.jl/pull/7/commits/9ce6927dc80a75c3d88ca5e0b30bb15c6b929a74).

What is your opinion?

ablaom commented 11 months ago

Mmmm. It's not going to work to allow data to be passed in the form (X, :column). The reason has to do with subsampling. When you evaluate a model using MLJ's evaluate! function, for example, and use, say, cross-validation, observations are resampled using MLJModelInterface.selectrows. This method only works as expected for matrices (with observations as rows), vectors, and tables. Any other kind of data (e.g. class weight dictionaries) don't get resampled (i.e., they're left as is).

We could overload selectrows to resample data of the form (X, :column) but there would need to be a really strong justification for this, which does not exist in this case, in my opinion. (In the future, I'd like to move away from MLJ's selectrows towards the getobs interface of MLUtils, which again will not work for data of the form (X, :column)).

Giving the user the option to specify :row/:column can be accommodated, but it would have to be as a third argument to your MLJModelInterface.fit method, so that, correspondingly, a machine is instantiated like this:

machine(model, X, y, :column)

but I suggest machine(model, X, y) should also work. (Model-specific keyword arguments to fit and machine are not supported.)

Since :column is neither a matrix, vector or table,. selectrows just leaves :column alone in resampling, so all is well in that case.

An alternative is to make :column\:row a hyperparamter (a field of model) but the general MLJ philosophy is that hyperparameters should be specifiable without seeing data.

I don't think you can avoid making breaking changes to address MLJ requirements.

BTW, standard practice in Julia development is to make your first release 0.1.0, so that the next breaking release is 0.2.0, and so forth. When the API stabilises, you make a 1.0 release. I think you can "yank" the 1.0 release from the registry, which I would recommend. You may want to get some advice on this, perhaps on the Julia slack channel #pkg-registration .

ablaom commented 11 months ago

just FYI: https://github.com/xKDR/TSFrames.jl

ablaom commented 8 months ago

@antoninkriz Any update re an MLJ-compliant solution here?

antoninkriz commented 8 months ago

@ablaom Sorry, I was quite overwhelmed with university in the past few months.

If I remember correctly, the code should be now compliant with MLJ (but I think I should rather check again), utilizing the transpose(...) solution. It really turned out to be the best one.

What remains is yanking 1.0 and re-releasing under 0.1. After finishing my exams this momth I'll put focus on this again.

And again, thanks for all your help!

antoninkriz / TimeSeriesClassification.jl

On the form of input data for the MLJ interface #5