arcee-ai / DALM

Domain Adapted Language Modeling Toolkit - E2E RAG
https://www.arcee.ai
Apache License 2.0
300 stars 39 forks source link

dalm qa-gen toy_data_train.csv doesn't work out of the box. #72

Closed bdiu29 closed 11 months ago

bdiu29 commented 11 months ago

I'm trying to generate triplets using dalm qa-gen on my local CSV file to that has a column called 'Passage' which contains my chunked texts. Apparently it's expecting a title column? Any chance you guys have an example of how the input data should be formatted for ingestion? Thanks!

bdiu29 commented 11 months ago

Okay, so it looks like I need to create unique titles for each chunk.

Jacobsolawetz commented 11 months ago

@bdiu29 thats right, both passage and title

Jacobsolawetz commented 11 months ago

@bdiu29 here is a toy example https://github.com/arcee-ai/DALM/blob/e6c3d293e7c75a43a6cfb3d78681969c1219e8c8/dalm/datasets/toy_data_train.csv

bdiu29 commented 11 months ago

Thanks! I figured it out. I used my text chunks in the 'Passage' column and passed them into Llama2 to generate unique 'Title' values for each chunk.

I think I tried running the dalm qa-gen on toy_data_train.csv and it was giving me the missing 'Title' column error as well.