Add script to prepare dataset from csv

Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

https://lightning.ai

Apache License 2.0

6.95k stars 733 forks source link

Add script to prepare dataset from csv #462

Closed Anindyadeep closed 8 months ago

Anindyadeep commented 9 months ago

This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)

Fixes #329

Anindyadeep commented 8 months ago

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

aniketmaurya commented 8 months ago

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week!

Anindyadeep commented 8 months ago

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week!

Thanks

Anindyadeep commented 8 months ago

@aniketmaurya Seems like tests are failing, this means should I need to change the requirements file or do the operations without pandas?

Anindyadeep commented 8 months ago

Hi @aniketmaurya, I am cool with it you can take it and finish it. Thanks for reviewing

Anindyadeep commented 8 months ago

LGTM. I added a test and made some changes to match the other prepare files.

Can this be added to tutorials/prepare_dataset.md? cc @rasbt

I can take this up additionally for the tutorial if you guys are okay with it. cc: @aniketmaurya @carmocca

rasbt commented 8 months ago

I can take this up additionally for the tutorial if you guys are okay with it. cc: @aniketmaurya @carmocca

Sounds great, @Anindyadeep !

I think we should include this new CSV approach above the "custom script" approach since this is potentially easier for some people. I modified the tutorial file accordingly and left a "TODO" section for you to fill in. Thanks for the great contribution in this PR here!

Also, if you have a draft, please let me know, I am happy to help and can then go over it as well.

Anindyadeep commented 8 months ago

Thanks @rasbt, I created the documentation, hope this should be good to go.

rasbt commented 8 months ago

Just tested the script and it works great. The only nit I have is, shouldn't it be

python scripts/prepare_csv.py --csv_path test_data.csv

instead of

python scripts/prepare_csv.py test_data.csv

for consistency, since we use --args in all other scripts?

Any thoughts @carmocca ?

Anindyadeep commented 8 months ago

Just tested the script and it works great. The only nit I have is, shouldn't it be
python scripts/prepare_csv.py --csv_path test_data.csv

Yeah, I agree with this.

Since it is a positional argument, I was not able to provide consistency, however, that can be changed if it is okay.

rasbt commented 8 months ago

Sorry, I actually meant changing it in the script itself

Anindyadeep commented 8 months ago

Sorry, I actually meant changing it in the script itself

Please correct me if we are on the same page or not. So I was talking about changing the script so that we can have --csv_path argument introduced 😅

rasbt commented 8 months ago

Oh, sorry for causing any confusion. So

So I was talking about changing the script so that we can have --csv_path argument introduced 😅

was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments).

Long story short, what I have in mind is that it'd be nice to have the --csv_path in both the script (and then in the documentation).

I think the problem of not having the --csv_path is that it would show up when someone uses prepare_csv.py --help to figure out the usage.

Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure.

Any thoughts there?

(CC @carmocca who is on top of all the stylistic choices in Lit-GPT)

Anindyadeep commented 8 months ago

Oh, sorry for causing any confusion. So

So I was talking about changing the script so that we can have --csv_path argument introduced 😅

was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments).

Long story short, what I have in mind is that it'd be nice to have the --csv_path in both the script (and then in the documentation).

I think the problem of not having the --csv_path is that it would show up when someone uses prepare_csv.py --help to figure out the usage.

Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure.

Any thoughts there?

(CC @carmocca who is on top of all the stylistic choices in Lit-GPT)

Yeah, I have a question, why we are using jsonargparse instead of fire , Which probably can handle positional arguments as the optional one.

Anindyadeep commented 8 months ago

Thanks, @carmocca, I have committed the suggestions and guess we are good to go.

Anindyadeep commented 8 months ago

Thanks, @aniketmaurya @rasbt @carmocca for the merge. It has been incredible. I am excited to contribute more.

rasbt commented 8 months ago

Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again!

Anindyadeep commented 8 months ago

Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again!

Thanks again and it would be awesome to get mentioned. PS: The LoRA Article is one of my best reads from your amazing set of articles. Excited to read the upcoming article on datasets.