Open bnaman50 opened 3 years ago
Thanks for your attention. Could you tell me what's the difference between your dataset and the one I provided? I will check it.
Hey @weili-baidu , thanks for your response.
Following the ReadMe, I simply downloaded the multi-news data from Google Drive, processed it using the script you provided and compared it with the preprocessed graph file that you have linked. But I observed a mismatch between two outputs as in terms of numbers.
Here is the preprocessed file that I generated after running your pre-processing scripts. This is the simple straight-forward way I used to check whether two files are same or not.
import json
with open(f'./src/data_preprocess/graphsum/'
f'MultiNews_data_tfidf_paddle/MultiNews.30.test.0.json', 'r') as f: # extract the zip files I provided
my_data = json.load(f)
with open(f'./src/data_preprocess/orig_multi_news/'
f'MultiNews_data_tfidf_30_paddle/test/MultiNews.30.test.0.json', 'r') as f: # after extracting the original zip file
orig_data = json.load(f)
## Sample order is not same because of the unordered imap in your code.
## Indices found using eyeballing
my_sample = my_data[0]
orig_sample = orig_data[1]
check_equal = my_sample == orig_sample
print(check_equal)
Your feedback will be appreciated.
Thanks, Naman
The test set has two parts MultiNews.30.test.0.json and MultiNews.30.test.1.json. Maybe the order of the data instances are different. So I think your script could not reflect the difference. Could you compare them more precisely?
Hey,
Thanks for providing the code. I was trying to fine-tune the model on my own dataset. This led me to look at your pre-processing code to make sure I am doing things correctly on my own dataset.
I compared the processed multi-news data that you provided with the one that I generated. But I think the results are not matching.
To generate the processed dataset -
max_nsents=30
and other parameters being default.I understand that the ordering can be different because of the unordered map but I even looked at them individually and they are not the same. Is it possible that the authors of multi-news dataset have updated their files and that's why I am observing this discrepancy?
Thanks, Naman