Add Container and Block for Text

Chandu-4444 commented 2 years ago

Tried starting at creating a simple textual recipe based on ImageFolders dataset recipe. This specifically works for imdb and similar datasets. Any feedback is highly appreciated.

Chandu-4444 commented 2 years ago

julia> using FastAI

julia> name, recipe = finddatasets(blocks=(Any, Any), name="imdb")[1]
Pair{String, FastAI.Datasets.DatasetRecipe}("imdb", TextFolders(FastAI.Datasets.parentname, false, FastAI.Text.var"#2#4"()))

julia> data, blocks = loadrecipe(recipe, datasetpath("imdb"))
((mapobs(loadfile, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…]), mapobs(parentname, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…])), (TextBlock(), Label{String}(["neg", "pos"])))

julia> text, class = obs = getobs(data, 1000)
("Every movie I have PPV'd because Leonard Maltin praised it to the skies has blown chunks! 
Every single one! 
When will I ever learn?<br /><br />Evie is a raving Old Bag who thinks nothing of saying she's dying of breast cancer to get her way! 
Laura is an insufferable Medusa filled with  The Holy Spirit (and her hubby's protégé)! 
Caught between these harpies is Medusa's dumb-as-a-rock boy who has been pressed into weed-pulling servitude by The Old Bag!<br /><br />
As I said, when will I ever learn?<br /><br />
I was temporarily lifted out of my malaise when The Old Bag stuck her head in a sink, but, unfortunately, she did not die. 
I was temporarily lifted out of my malaise again when Medusa got mowed down, but, unfortunately, she did not die. 
It should be a capital offense to torture audiences like this!<br /><br />
Without Harry Potter to kick him around, Rupert Grint is just a pair of big blue eyes that practically bulge out of its sockets.  
Julie Walters's scenery-chewing (especially the scene when she \"plays\" God) is even more shameless than her character.
<br /><br />
At least this Harold bangs some bimbo instead of Maude. 
For that, I am truly grateful. And if you're reading this Mr. Maltin, you owe me \$3.99!", "neg")

Chandu-4444 commented 2 years ago

I have started adding functions for replacing words that start with uppercase letters, contain all uppercase letters with special tokens like xxup, xxmaj etc. All the remaining utilities used for preprocessing can be used from JuliaText.

Chandu-4444 commented 2 years ago

Is there a fastai tutorial that uses this dataset? Would be helpful to know what kind of tasks could be tackled with this.

Yes, fastai does have a tutorial that uses this dataset, https://docs.fast.ai/tutorial.text.html. This tutorial focuses on the sentiment analysis. The first part uses a pre-trained language model (called AWD-LSTM) on Wikipedia for predicting the next word (language generation), and is directly used for predicting the sentiment for the given review. In the second part of the tutorial, they used an approach called ULMFit approach that involves fine-tuning the model with the IMDB dataset and using that for predicting the sentiment. They achieved SOTA using the second method.

I'll commit to the suggestions provided and will improve upon those.

Simultaneously, I'll start looking into that AWD-LSTM (https://arxiv.org/abs/1708.02182) paper to get deeper into how the model works. After that, the plan was to go through the ULMFit (https://arxiv.org/abs/1801.06146) paper.

lorenzoh commented 2 years ago

Sorry for letting this sit!

Tests were failing due to issues that should be fixed on master, so merging master into this should make the CI green.

Last thing that would be good to have would be some tests

Chandu-4444 commented 2 years ago

Sure! Will synchronise it with master and add some tests.

Chandu-4444 commented 2 years ago

Umm... For writing tests to the TextFolders(), I need to access the IMDb dataset. I remember Lorenz mentioning that it isn't very nice to use large datasets for testing as it might overload the CI system. And for other recipes, there are smaller version datasets that replicate the original larger version datasets. I couldn't find any such datasets for IMDb (Actually there is one such dataset that is available as a CSV file, but I need an IMDb-like directory structure for testing the recipe). Is there any workaround?

ToucheSir commented 2 years ago

I wouldn't worry about testing the bits that require file IO for now, mostly the helper functionality.

Chandu-4444 commented 2 years ago

That sounds good!

FluxML / FastAI.jl

Add Container and Block for Text #207