Alex-Fabbri / Multi-News

Large-scale multi-document summarization dataset and code
Other
269 stars 53 forks source link

The data from google drive #3

Closed Oscar860601 closed 5 years ago

Oscar860601 commented 5 years ago

Hello, I am not sure whether the data from google drive has already been truncated, cleaned, fixed, and tokenized or not. Because in the run_prep_newser.sh the data's name is train.txt.src.tokenizd.fixed.cleaned.truncated and the data I download from google drive's name is only train.txt.src. Should I do all those things on my own? Thanks. Great work by the way !!

Alex-Fabbri commented 5 years ago

Hi, sorry for the late response! This issue somehow slipped by me. So those are already cleaned and tokenized. You can also see the story_tag_separator which separates sources. This should be good to go as input to a model. Let me know if you have additional questions.

Edit: I have added this function used to truncate the input to the model https://github.com/Alex-Fabbri/Multi-News/blob/master/data/scripts/prep_data.py

Alex-Fabbri commented 5 years ago

I just uploaded a version without most of the preprocessing (see readme for separator tokens). Closing this issue for now. Feel free to reopen with additional questions.

Vincent-Li-9701 commented 5 years ago

Hi Alex,

In the README, the dataset you linked above is labeled as prepossessed dataset and multi-news-original is the unprocessed. I just want to confirm that the data in multi-news is the version ready to be used in the model. Am I correct? Please correct me if I'm mistaken.

Thank you

Alex-Fabbri commented 5 years ago

Sorry for the confusion. By preprocessed I was just referring to the tokenization, but this is still the non-truncated version of the data. To reproduce the truncated version, please see the truncate function. I will also upload a truncated version as soon as I can.

bnaman50 commented 3 years ago

Hey @Alex-Fabbri ,

I think there is slight discrepancy in the data provided on the google drive and the data generated by the scripts. I wanted to understand the code. Thus, I ran the scripts for just one sample (id=136951).

This is how the source and target output looks like by your prep_data.py file (call it Case I) -

Source -

Iran is planning new military exercises near the strategic Strait of Hormuz , according to a naval commander , after threatening to close the strait and completing another set of maneuvers. In this picture ... 

(truncated for brevity)

Target -

– western officials are fairly sanguine that iran ' s recent provocations are only posturing . but on the streets of tehran , sanctions are taking a real toll , and the populace expects violence , the washington post reports . the country is in the midst of a currency crisis that has made prices shoot up on everything from iphones to crucial medicine — when either can be found at all . " i will tell you what this is leading to : war , " says one merchant . " my family , friends and i — we are all desperate . " iran certainly isn ' t doing anything to assuage those fears ; late yesterday its semi-official news agency announced that it would conduct another round of military exercises near the strait of hormuz next month . a naval commander says these war games will be " different " than the attention-grabbing 10-day drill iran ' s navy just completed , without specifying how , according to the ap . click for more on iran ' s recent internet crackdown .

whereas this is what is there on Google Drive, val.txt.src file (call it Case II)-

Source -

iran is planning new military exercises near the strategic strait of hormuz , according to a naval commander , after threatening to close the strait and completing another set of maneuvers .     in this picture ...

(truncated for brevity)

Target -

( Newser ) – Western officials are fairly sanguine that Iran ' s recent provocations are only posturing. But on the streets of Tehran , sanctions are taking a real toll , and the populace expects violence , the Washington Post reports. The country is in the midst of a currency crisis that has made prices shoot up on everything from iPhones to crucial medicine — when either can be found at all. " I will tell you what this is leading to : war , " says one merchant. " My family , friends and I — we are all desperate. " Iran certainly isn ' t doing anything to assuage those fears ; late yesterday its semi-official news agency announced that it would conduct another round of military exercises near the Strait of Hormuz next month. A naval commander says these war games will be " different " than the attention-grabbing 10-day drill Iran ' s Navy just completed , without specifying how , according to the AP. Click for more on Iran ' s recent Internet crackdown.

I see the following differences -

  1. Text is lowercased in Case II but not in Case I.
  2. I think you there is a paragraph separator (large space) in Case II but not in Case I i.e. observe the space after another set of maneuvers . phrase in Case I vs Case II.
  3. The target file also has similar issues with initial - and Newser is also removed.

I am using someone else's project which makes use of your processed dataset. My goal is to fine-tune their model on my own dataset. Thus, I just want to make sure I am doing the pre-processing correctly.

Could you please help me figure this out? This paragraph separator thing is really important for their model if I understand correctly.

Thanks, Naman

Alex-Fabbri commented 3 years ago

Probably we used a slightly different script for the processed data in the paper (uploaded to Google Drive) which replaced NEWLINE_CHAR with the larger spaces you see and lower cased it. I'd recommend that you use the raw data from this link, which replaced \n with "NEWLINE_CHAR" and appended "|||||" to the end of each story. You can then use the NEWLINE_CHAR symbols to find the paragraph separators and then process the paragraphs using the code from the project you mentioned.

Please let me know if there are any additional questions.

bnaman50 commented 3 years ago

Hey @Alex-Fabbri ,

Thanks for your quick response. I just have one small question. How do I handle the discrepancy in the target text. I can easily add the - character in the beginning if the need arises but how do I handle the extra words like ( Newser ) as shown in my previous post? I just looked at one particular sample so I am not sure if there would be other such discrepancies.

The thing is I need to pre-process my own data so that I can fine-tune a MDS model. So I just want to make sure that the pre-processing is same as it was on the Multi-News data. Could you please help me out?

Thanks, Naman

Alex-Fabbri commented 3 years ago

Regarding the target, I believe all of the raw target texts start with "(NEWSER) –" and about the preprocessing, if you have a specific question, please let me know, although you may want to ask the authors from the repo you mentioned, if you plan to use their preprocessing code.

bnaman50 commented 3 years ago

Hey @Alex-Fabbri ,

Thanks for your response. It has been of huge help to me.

Best, Naman