NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

DAPT codegen data curation tutorial #88

Open jvamaraju opened 1 month ago

jvamaraju commented 1 month ago

Description

Added a tutorial to curate data for DAPT codegen use-case with Nemo Curator

Usage

codegen_DAPT\llm_data_collection\notebooks\data_curation-nemo-curator_DAPT.ipynb

Checklist

Maghoumi commented 4 weeks ago

@jvamaraju Thank you for your contribution. Here's some initial feedback after spending some time going through the PR.

High level comments

Considering that this is a tutorial, we need to have explicit entry points for each topic where we are attempting to teach users how to accomplish certain things, or show them the best practices. We can follow a structure roughly similar to what you walked us through in the weekly meetings. Consider these (hypothetical) steps. I think the tutorial greatly benefits from having a "flow" where users can experience each data curation step by going through the flow:

Specifically, I think the data acquisition step also deserves a dedicated notebook where users can walk through the sequence of operations they need to do before they can follow along with the rest of the tutorial. Also, if there are shell scripts that need execution, we can make cells for them inside the notebook using % operator (see this example). This way, users won't have to worry about the order of executing different scripts.

Lastly, where relevant, I suggest clearing the cell outputs from the notebooks. I found myself scrolling a lot while going through the code and understanding the structure. For instance, this notebook, is around 2MB and most cell outputs contain logs that users would be able to reproduce if they ran it locally. We should be safe to remove all those outputs from the checked in file. In some cases, it might make sense to keep that output (e.g. when we're trying to show a sample output to the user), so it's fine to keep them.

Organization

The tutorial consists of 4 main types of files:

  1. Python scripts
  2. Shell scripts
  3. Notebook files
  4. Data files consisting of
    1. Images (e.g. PNG)
    2. JSONL (e.g. this one)
    3. Data that the user downloads.

One challenge I faced while going through the repo was that different files appear in different places. In order to greatly enhance the readability and organization, I suggest the following structure (this is a rough sketch, feel free to incorporate based on your intuition of course!) :

PDF Cleaner

This is a very useful tool and is standalone. I think it can be reused over and over throughout the repo. In its current form, it looks like it is a step of the tutorial. I'll defer to @ayushdg or others to comment on the best organization for this. It also contains docker image creation, etc. which appear to be out of the scope of the tutorial. Let's brainstorm on the most logical way of integrating this tool into the repo.