FluxML / FastAI.jl

Repository of best practices for deep learning in Julia, inspired by fastai
https://fluxml.ai/FastAI.jl
MIT License
588 stars 51 forks source link

Blocks and container added for Text Dataset #205

Closed arcAman07 closed 2 years ago

arcAman07 commented 2 years ago

Registered the NLP ( Text ) dataset to be added in the upcoming months. Added functions for the blocks of the Text dataset. All the nlp dataset ( which are registered ) along with their forthcoming models will be added . Exploring Julia Text, MLutils and other package along with FastAI concepts so that these datasets can work well with Flux. As almost all the text datasets are in csv format it will be easily lo load them and create the corresponding container, working on further concepts to implement these text datasets.

Currently I have added the entire basic structure of the Text Data comprising of the blocks and the containers. Have researched a lot since a week ( understanding FastAI docs and codebase ). Currently working on adding textrow block along with the recipes.jl. Also currently working on two datasets "imdb" and "amazon_review_full" as both have different folder structure so different blocks would be required. Also going through the 2 papers which have built state of the art model for these two datasets and working on its implementation. Any reviews thus far will be appreciated.

Reopened PR#100 , needed to delete that repo due to merging issue.

arcAman07 commented 2 years ago

Blocks and container is full added. ( similar to the tabular datasets). Currently working on the models ( reading the papers ), adding the recipes and also exploring other libraries. Also had some doubts regarding the things that need to be added here.

darsnack commented 2 years ago

I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly.

arcAman07 commented 2 years ago

I'm not sure this PR is ready to review. It looks like it is a copy-paste of the table data blocks with some renaming. That's a good way to start, but note that table and text data is not necessarily the same. I would suggest seeing one of the subtasks to completion. For example, actually add the recipe that you are proposing and demonstrate that it loads correctly.

Yep was working on this currently as a draft PR. Will add a container which will work for the textual dataset, was just experimenting and seeing the results with the TableDataset. The blocks and the main Text.jl is added currently. Was reading the paper "Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)." which uses the amazon_full_csv, ag_news_csv dataset to train the model so that well versed with the trainingmethod. Simulteamously working on the recipes.jl to load the datasets .

arcAman07 commented 2 years ago

So I have done couple of changes. I really wanted to fit the text data ( eg) amazon_review_full_csv, ag_news_csv , etc ) into a tabular data format as even in the official fastai tutorial ( for the text dataset ) have done it that way. As most of the text dataset to be implemented is news ( without any headers/column names ) but consist majorly of three columns "rating", "title" and "news" , so I added those headers to recipes.jl which I was working on so that even while developing the tutorial and for visualization it is easier for the user to understand the data. Have tested it locally and it is working perfectly. Currently encodings added ( will add more more encodings specific to text while working on the model for training ), containers and blocks are added for the text dataset . Currently working on writing tests for these blocks and containers along with the training implementation for these models. With this format all of the text dataset can be added. Would appreciate some reviews so that I can further improvise it

arcAman07 commented 2 years ago

image image

arcAman07 commented 2 years ago

image image

This is for the ag_news_csv dataset.

darsnack commented 2 years ago

Is there a specific tutorial you are targeting here? It would be helpful to reference that as we review.

arcAman07 commented 2 years ago

The inspiration of the tutorial to be made is from the official fastai text tutorial => https://docs.fast.ai/tutorial.text.html The data is visualized in a tabular format ( the classes and the text is shown which can be shown by the TextClassificationRecipe struct ) and then the further tutorial deals with training that model and it's visualizations. I do plan to start work with news dataset as the paper I referred earlier in the PR covers the architecture required for training the models on these datasets.

arcAman07 commented 2 years ago

To all the maintainers, I just had a question whether there is a need to add text transformations/ cleaning module to this package as it is present in the fastai python package? If we are working with JuliaText it might not be required and if needed be we can add those functions which we might require in the existing repos of JuliaText.

lorenzoh commented 2 years ago

Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by Encodings to work with the rest of high-level FastAI.jl machinery.

These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs.

arcAman07 commented 2 years ago

Not super familiar with the domain, but I assume we can reuse functionality from JuliaText, though that will have to be wrapped by Encodings to work with the rest of high-level FastAI.jl machinery.

These should definitely separate PRs though! As Kyle mentioned above, I think it's best to focus this PR on adding a recipe for a text dataset and then work on additional features in separate PRs

Great am reading through the paper "Character-level Convolutional Networks for Text Classification" to implement the architecture to train the various news dataset used here along with going through JuliaText and its packages which we can use as text transformations and encodings. Currently have added the blocks and container to load the recipes which are working well ( just like in offical fastai tutorial in a tabular way ). Would love some feedbacks so that I can finish this PR in its totality and start working on the encodings and tutorial in an another PR.

darsnack commented 2 years ago

This particular task seems like a classification task on table data. Does it need a separate dataset recipe type, or can it just reuse the table stuff?

Like Lorenz suggested, I think the transforms, etc. should be left out of this PR and only the recipes added. This PR has added a lot of recipes which is great! But the current loadrecipe appears to be hardcoding column names, etc. I would suggest rewriting the datasets to use the existing tabular recipes, then separately think about a text classification task that has a TableRow + Label block. You can look at the tabular classification task as an example.