Closed LisaWang0306 closed 2 years ago
Hi @LisaWang0306 , you need to generate the data (=summary candidates) first. This is done in src/candidate_generation/main_candidate_generation.py This will generate num_beams candidate for each source document in the dataset.
Then, you need to score these candidates to evaluate the re-ranking. This is done in src/candidate_generation/main_scores.py
Let me know if you need any help running these scripts.
Hi, when executing src/candidate_generation/main_candidate_generation.py
, it gives following error. So could you please share a sample file to show how do you organize your data file ? It seems each line of txt
file is a article-summary
pair ?
FileNotFoundError: [Errno 2] No such file or directory: '/data/mathieu/DATASETS/RedditTIFU/data/en/val_text.txt'
Hi @Ravoxsg, thanks for your reply! I have the same question with @Hannibal046. The RedditTIFU dataset I find is a large json file. How do you organize it to txt files? Thanks!
Hi guys @Hannibal046 and @LisaWang0306 I've added a new script main_download_dataset.py which downloads the dataset from HuggingFace datasets, and saves it into .txt files. Have a look at it there: https://github.com/Ravoxsg/SummaReranker/blob/main/src/candidate_generation/main_download_dataset.py
I've also updated the Readme accordingly.
Please note: you need to change every path flagged by a todo symbol in all the python files starting with main (main_download_dataset.py, main_candidate_generation.py, main_scores.py, etc).
These paths specify where you want to save the files (dataset files, generated summaries, summary scores, etc etc).
I typically save them outside of the code repository.
@Hannibal046 @LisaWang0306 I refactored to relative paths, should be easier to follow all steps. Lmk.
Running the program needs data files in /data/mathieu/DATASETS/RedditTIFU/data/en/, how can I get those files?