Open jvamaraju opened 1 month ago
@jvamaraju Thank you for your contribution. Here's some initial feedback after spending some time going through the PR.
Considering that this is a tutorial, we need to have explicit entry points for each topic where we are attempting to teach users how to accomplish certain things, or show them the best practices. We can follow a structure roughly similar to what you walked us through in the weekly meetings. Consider these (hypothetical) steps. I think the tutorial greatly benefits from having a "flow" where users can experience each data curation step by going through the flow:
Step 1: Dependency Installation
Step 2: Data Acquisition
Step 3: Data Curation using NeMo Curator
Step 4: Data Curation without NeMo Curator
Specifically, I think the data acquisition step also deserves a dedicated notebook where users can walk through the sequence of operations they need to do before they can follow along with the rest of the tutorial. Also, if there are shell scripts that need execution, we can make cells for them inside the notebook using %
operator (see this example). This way, users won't have to worry about the order of executing different scripts.
Lastly, where relevant, I suggest clearing the cell outputs from the notebooks. I found myself scrolling a lot while going through the code and understanding the structure. For instance, this notebook, is around 2MB and most cell outputs contain logs that users would be able to reproduce if they ran it locally. We should be safe to remove all those outputs from the checked in file. In some cases, it might make sense to keep that output (e.g. when we're trying to show a sample output to the user), so it's fine to keep them.
The tutorial consists of 4 main types of files:
One challenge I faced while going through the repo was that different files appear in different places. In order to greatly enhance the readability and organization, I suggest the following structure (this is a rough sketch, feel free to incorporate based on your intuition of course!) :
scripts
). It can have subfolders to organize based on the step of data curation (e.g. scripts/1-data-acquisition
or scripts/2-data-curation
).helpers
directory. Reusable code can go into a common
folder for instance.dataset
folder that you have can serve as this? I'm thinking we could create a specific folder inside it dataset/downloaded
where users can download the dataset under. Then everything else can reside in dataset
directly (e.g. JSONL files).This is a very useful tool and is standalone. I think it can be reused over and over throughout the repo. In its current form, it looks like it is a step of the tutorial. I'll defer to @ayushdg or others to comment on the best organization for this. It also contains docker image creation, etc. which appear to be out of the scope of the tutorial. Let's brainstorm on the most logical way of integrating this tool into the repo.
Description
Added a tutorial to curate data for DAPT codegen use-case with Nemo Curator
Usage
Checklist