✨ Add 7 workflow examples for MLJFlux

EssamWisam commented 4 months ago

Added the following workflow examples:

Sorry if I didn't make the PR in the expected time. I encountered interruptions and was stuck in live training for longer than I wanted. In particular, I was trying hard to get it to work in Jupyter notebooks.

Will make another PR for a single tutorial soon.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.08%. Comparing base (ec6004f) to head (d5fe2c7). Report is 20 commits behind head on dev.

:exclamation: Current head d5fe2c7 differs from pull request most recent head 38ae2d5

Please upload reports for the commit 38ae2d5 to get more accurate results.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev #256 +/- ## ======================================= Coverage 92.08% 92.08% ======================================= Files 12 12 Lines 316 316 ======================================= Hits 291 291 Misses 25 25 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

EssamWisam commented 4 months ago

Added commit for another tutorial.

ablaom commented 4 months ago

Since you propose a tutorial based on an RNN, you should probably study https://github.com/FluxML/MLJFlux.jl/issues/225. See in particular, the point about the observation dimension, near the end.

ablaom commented 4 months ago

Looks like the Early Stopping markdown did not generate (I'm seeing "empty file").

ablaom commented 4 months ago

After some more thought, I wonder if the spam example isn't little too long and challenging for documentation.

This is my understanding: You're representing messages as sequences of integers - one integer per word in some fixed vocabulary (and a special integer for words outside). To train an RNN-based classifier you need to one-hot encode but to do that lazily you want to use Flux's embedding layer, rather than an external one-hot encoding transformer which will just produce a very large table. However, an obstacle in your approach is that is the MLJFlux's NeuralNetworkClassifier does not support categorical data, so you have to fool it by representing integers as floats, and then reverse this in a pre-processing layer of the NN preceding the Embed layer. Do I have this right?

I can see how all this should work, and it's great to see this solution, however it's definitely not MLJ idiomatic, which adds a cognitive burden for users who want to first understand the "proper" way of doing things.

Of course, after the summer project, you will have extended MLJFlux to overcome this kind of issue, but with something probably more useful than one-hot encoding for high cardinality features (we'll be learning an embedding directly into a lower dimensional space) and something that can be transferred to other classifiers. So maybe a new version of this example would be appropriate then.

EssamWisam commented 4 months ago

Looks like the Early Stopping markdown did not generate (I'm seeing "empty file").

@ablaom for some reason, I can see the full files in the Github review log.

EssamWisam commented 4 months ago

@ablaom

After some more thought, I wonder if the spam example isn't little too long and challenging for documentation.

It's specifically in the tutorials section of the documentation which I think doesn't have to be limited to less challenging tutorials (esp. with the rest of the docs and the workflow examples being present).

As for it being potentially long, I think it's shorter than Boston and MNIST tutorials and I tried to be systematic in explaining it so one does not get lost. As for it being potentially challenging, I would say unlikely unless for someone that did not come accross using RNNs before (in which case it helps them learn more about RNNs but isn't so easy because if I were to explain more on this it would feel like a lecture and be quite long for those that have came across using them and just want to see more applications; also tokenization, embedding and other concepts are not missed in NLP courses).

This is my understanding: You're representing messages as sequences of integers ... To train an RNN-based classifier you need to one-hot encode but to do that lazily.

Yes, the embedding layer simulates the functionality that will be performed on one-hot vectors with an equivalent functionality over integers which makes it more efficient.

An obstacle in your approach is that is the MLJFlux's NeuralNetworkClassifier does not support categorical data, so you have to fool it by representing integers as floats, and then reverse this in a pre-processing layer of the NN preceding the Embed layer. Do I have this right?

Yes, correct. A sentence in this case is viewed as a sequence of categorical columns where each takes values from the same set (the vocabulary).

I can see how all this should work, and it's great to see this solution, however it's definitely not MLJ idiomatic, which adds a cognitive burden for users who want to first understand the "proper" way of doing things.

Yes, I agree, but if we consider the degree of deviation from MLJ-idiomatic code we find it's only the Intify layer which is certainly easy to understand and is well-justified (i.e., has neglible effects on learning or generalizing to other applications because it should be well clear why we added this very minor layer).

Minor comment on "this should work", yes it has expectedly worked as in the executed notebook/markdown.

Of course, after the summer project, you will have extended MLJFlux to overcome this kind of issue...

After some reflection, I would like to emphasize that what follows the three dots may not be exactly precise. In particular, the difference from MLJFlux's embedding layer and the custom layer that will be added to deal with categorical data is that the former will share weights among all words (and that's useful); meanwhile, the latter assumes each column in the vector is another categorical variables with different categories and by default uses an embedding layer for each column.

Thereby, I have thought about two resolutions for this issue:

Resolution A:

Allow Count scientific type aside from floats in MLJFlux classifier/regressor (can be done now)
The embedding layer that will be added in the summer project will deal with Multiclass/Ordinal columns (and will ignore Count/Float columns.

Why A makes sense:

The tutorial will immediately become MLJ idiomatic (X coerced to integers and no need for Intify)
In general, my classification or regression data may have integers that are still not categorical
The new custom embedding layer for categorical data will be ignored in cases like this (all data is Counts) which is what we need as one embedding layer only should be used (and not a different embedding layer for each index in the sequence).

Resolution B:

Still allow Count scientific types for the aforementioned reason (my classification or regression data may have integers that are still not categorical)
When implementing the summer project, go beyond/generalize over the paper and allow weight sharing between the different embedding layers

Why B makes sense:

If I have classification or regression data with integer columns MLJFlux doesn't break
The tutorial becomes MLJ idiomatic without coercing the indices to Count. They can be coerced to Multiclass and specify that all embedding layers (each index in the sequence treated as categorical column) should share weights.

EssamWisam commented 4 months ago

Herculean effort, thanks!

(I haven't checked the notebooks or markdown).

Thank you so much for the high quality review. Much appreciated.

No need to check them as they are automatically generated from the Julia file. I also would quickly glance over them in the documentation to check everything is alright (e.g., the reason why I had misformed indentation for some comments is that I was trying to reverse the already misformed representation in documenter; however, that now is fixed thanks to you tab is four spaces advice).

ablaom commented 3 months ago

I can see how all this should work, and it's great to see this solution, however it's definitely not MLJ idiomatic, which adds a cognitive burden for users who want to first understand the "proper" way of doing things.

I see you agree with my statement above, and for this reason will push back against inclusion in the manual on this crucial point. Including non-idiomatic examples in core documentation is not a good idea, in my opinion.

I do agree we could immediately expand the allowed scitype for input to include Count for use with genuine count data, expected to be represented as integers, which, of course, a neural network can deal with. However, your example remains non-idiomatic because we are supplying Multiclass data under the disguise of Count data (but we can then skip the coercion to and from floats).

Sorry, I'm not sure I understand the distinction in A and B, but I will say weight sharing for categorical feature encoding sounds like a substantial complexity addition from the point of view of user interface because now they have to specify exactly how that looks like. But we can discuss on our next call.

EssamWisam commented 3 months ago

Alrighty, I see how this makes sense from a perfection perspective or by emphasizing the experience for beginners. I have hidden that tutorial accordingly. We may merge if you have seen the EarlyStopping tutorial.

Sure. We can talk about the distinction between A and B in the next call.

ablaom commented 3 months ago

Thanks. Merge when ready.

Note that I have just merged #251 onto dev, which has a change that likely breaks the examples: I believe you must now use optimisers from Optimisers.jl in place of the Flux.jl optimisers (which I'm guessing will ultimately be deprecated in any case). Since we will need to update the example Manifest.toml files, I propose we update the docs in a separate PR after the breaking release I plan soon.

FluxML / MLJFlux.jl

✨ Add 7 workflow examples for MLJFlux #256

Codecov Report