DAPT codegen data curation tutorial

@jvamaraju Thank you for your contribution. Here's some initial feedback after spending some time going through the PR.

High level comments

Considering that this is a tutorial, we need to have explicit entry points for each topic where we are attempting to teach users how to accomplish certain things, or show them the best practices. We can follow a structure roughly similar to what you walked us through in the weekly meetings. Consider these (hypothetical) steps. I think the tutorial greatly benefits from having a "flow" where users can experience each data curation step by going through the flow:

Step 1: Dependency Installation
- Before users do anything, they need to understand all the code dependencies they need to fulfill. This includes pip package installs, PDF cleaner installation, etc. We could either have a notebook for this or a README file. I think a notebook would be better due to its interactive nature.
Step 2: Data Acquisition
- Substep 1: Users need to download the data from the Google Drive share. Show them where the data is located, and what directory this needs to be under.
- Substep 2: Users need to run some shell scripts to traverse the downloaded data and create the relevant meta data.
- ... [The rest follows] ...
Step 3: Data Curation using NeMo Curator
- Substep 1: Install dependencies
- Substep 2: .....
Step 4: Data Curation without NeMo Curator
- Substep 1: .....
- Substep 2: .....

Specifically, I think the data acquisition step also deserves a dedicated notebook where users can walk through the sequence of operations they need to do before they can follow along with the rest of the tutorial. Also, if there are shell scripts that need execution, we can make cells for them inside the notebook using % operator (see this example). This way, users won't have to worry about the order of executing different scripts.

Lastly, where relevant, I suggest clearing the cell outputs from the notebooks. I found myself scrolling a lot while going through the code and understanding the structure. For instance, this notebook, is around 2MB and most cell outputs contain logs that users would be able to reproduce if they ran it locally. We should be safe to remove all those outputs from the checked in file. In some cases, it might make sense to keep that output (e.g. when we're trying to show a sample output to the user), so it's fine to keep them.

Organization

The tutorial consists of 4 main types of files:

Python scripts
Shell scripts
Notebook files
Data files consisting of
1. Images (e.g. PNG)
2. JSONL (e.g. this one)
3. Data that the user downloads.

One challenge I faced while going through the repo was that different files appear in different places. In order to greatly enhance the readability and organization, I suggest the following structure (this is a rough sketch, feel free to incorporate based on your intuition of course!) :

Aggregate all scripts inside one folder (e.g. scripts). It can have subfolders to organize based on the step of data curation (e.g. scripts/1-data-acquisition or scripts/2-data-curation).
Organize Python scripts into logical units. For instance, if there are scripts that are helpers, they can all go inside a helpers directory. Reusable code can go into a common folder for instance.
Aggregate all images in the same place. This has been already done, but some PNG files are still floating around.
Create a centralized place for all other types of data. Perhaps the dataset folder that you have can serve as this? I'm thinking we could create a specific folder inside it dataset/downloaded where users can download the dataset under. Then everything else can reside in dataset directly (e.g. JSONL files).

PDF Cleaner

This is a very useful tool and is standalone. I think it can be reused over and over throughout the repo. In its current form, it looks like it is a step of the tutorial. I'll defer to @ayushdg or others to comment on the best organization for this. It also contains docker image creation, etc. which appear to be out of the scope of the tutorial. Let's brainstorm on the most logical way of integrating this tool into the repo.

NVIDIA / NeMo-Curator