How can I find the data files?

Ravoxsg / SummaReranker

Source code for SummaReranker (ACL 2022)

MIT License

24 stars 10 forks source link

How can I find the data files? #2

Closed LisaWang0306 closed 2 years ago

LisaWang0306 commented 2 years ago

Running the program needs data files in /data/mathieu/DATASETS/RedditTIFU/data/en/, how can I get those files?

Ravoxsg commented 2 years ago

Hi @LisaWang0306 , you need to generate the data (=summary candidates) first. This is done in src/candidate_generation/main_candidate_generation.py This will generate num_beams candidate for each source document in the dataset.

Then, you need to score these candidates to evaluate the re-ranking. This is done in src/candidate_generation/main_scores.py

Let me know if you need any help running these scripts.

Hannibal046 commented 2 years ago

Hi, when executing src/candidate_generation/main_candidate_generation.py, it gives following error. So could you please share a sample file to show how do you organize your data file ? It seems each line of txt file is a article-summary pair ?

FileNotFoundError: [Errno 2] No such file or directory: '/data/mathieu/DATASETS/RedditTIFU/data/en/val_text.txt'

LisaWang0306 commented 2 years ago

Hi @Ravoxsg, thanks for your reply! I have the same question with @Hannibal046. The RedditTIFU dataset I find is a large json file. How do you organize it to txt files? Thanks!

Ravoxsg commented 2 years ago

Hi guys @Hannibal046 and @LisaWang0306 I've added a new script main_download_dataset.py which downloads the dataset from HuggingFace datasets, and saves it into .txt files. Have a look at it there: https://github.com/Ravoxsg/SummaReranker/blob/main/src/candidate_generation/main_download_dataset.py

I've also updated the Readme accordingly.

Please note: you need to change every path flagged by a todo symbol in all the python files starting with main (main_download_dataset.py, main_candidate_generation.py, main_scores.py, etc).

Ravoxsg commented 2 years ago

These paths specify where you want to save the files (dataset files, generated summaries, summary scores, etc etc).

Ravoxsg commented 2 years ago

I typically save them outside of the code repository.

Ravoxsg commented 2 years ago

@Hannibal046 @LisaWang0306 I refactored to relative paths, should be easier to follow all steps. Lmk.