CurationCorp / curation-corpus

Code for obtaining the Curation Corpus abstractive text summarisation dataset
Creative Commons Attribution 4.0 International
122 stars 27 forks source link

what is the file format of { data_path="../data/private_dataset.file",} #8

Closed RiverTre closed 3 years ago

RiverTre commented 3 years ago

I see the dataset for fine tuning is stored at ../data/private_dataset.file, and codes show that it at least has column "text" and "summary". Could you offer the format of this file or offer an small example of it?

image

RiverTre commented 3 years ago

Sorry to bother but as a freshman in nlp, your help would mean a lot to me.

RiverTre commented 3 years ago

it is about {https://github.com/CurationCorp/curation-corpus/blob/master/examples/bart/finetuning-bart.ipynb} when i tried to reproduce your codes

HenryDashwood commented 3 years ago

Hi there. We dropped support for feather format because csv was fast enough so I don't think you'll find that anymore. If you have the csv of the dataset you can change ds = pd.read_feather(args.data_path).iloc[:args.subset] in the next cell to ds = pd.read_csv(args.data_path).iloc[:args.subset]

HenryDashwood commented 3 years ago

This notebook is a bit out of date now though as fastai2 has been merged into fastai. If you want to finetune bart with fastai I would recommend looking at the summarisation code here https://github.com/ohmeow/blurr

RiverTre commented 3 years ago

Thank you so so so much for reply. I will check the code of blurr today. ( And yes, I am trying hard to finetune bart to summarize a medical paper dataset in order to finish the final paper of college.