About ACL2020-GraphSum - Githubissues

bnaman50 commented 3 years ago

Hey,

Thanks for providing the code. I was trying to fine-tune the model on my own dataset. This led me to look at your pre-processing code to make sure I am doing things correctly on my own dataset.

I compared the processed multi-news data that you provided with the one that I generated. But I think the results are not matching.

To generate the processed dataset -

I first download the multi-news data from this drive link. (I assume you used this particular version since you have simply referred to their repo in your readme but sadly, their repo has multiple versions of the dataset)
Next, I ran your scripts with max_nsents=30 and other parameters being default.

I understand that the ordering can be different because of the unordered map but I even looked at them individually and they are not the same. Is it possible that the authors of multi-news dataset have updated their files and that's why I am observing this discrepancy?

Thanks, Naman

Weili-NLP commented 3 years ago

Thanks for your attention. Could you tell me what's the difference between your dataset and the one I provided? I will check it.

bnaman50 commented 3 years ago

Hey @weili-baidu , thanks for your response.

Following the ReadMe, I simply downloaded the multi-news data from Google Drive, processed it using the script you provided and compared it with the preprocessed graph file that you have linked. But I observed a mismatch between two outputs as in terms of numbers.

Here is the preprocessed file that I generated after running your pre-processing scripts. This is the simple straight-forward way I used to check whether two files are same or not.

import json

with open(f'./src/data_preprocess/graphsum/'
          f'MultiNews_data_tfidf_paddle/MultiNews.30.test.0.json', 'r') as f: # extract the zip files I provided
    my_data = json.load(f)

with open(f'./src/data_preprocess/orig_multi_news/'
          f'MultiNews_data_tfidf_30_paddle/test/MultiNews.30.test.0.json', 'r') as f: # after extracting the original zip file
    orig_data = json.load(f)

## Sample order is not same because of the unordered imap in your code.
## Indices found using eyeballing
my_sample = my_data[0]
orig_sample = orig_data[1]

check_equal = my_sample == orig_sample
print(check_equal)

Your feedback will be appreciated.

Thanks, Naman

Weili-NLP commented 3 years ago

The test set has two parts MultiNews.30.test.0.json and MultiNews.30.test.1.json. Maybe the order of the data instances are different. So I think your script could not reflect the difference. Could you compare them more precisely?

PaddlePaddle / Research

About ACL2020-GraphSum #185