abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
635 stars 306 forks source link

How to generate the anonymized version? #19

Open Oscar860601 opened 6 years ago

Oscar860601 commented 6 years ago

@abisee Did you wrote code for generating anonymized version of cnn-dailymail summarizaition dataset?

AlJohri commented 6 years ago

@Oscar860601

The original data is from here: https://github.com/danqi/rc-cnn-dailymail

The code to download them is here: https://github.com/deepmind/rc-data

Oscar860601 commented 6 years ago

Oh I meant anonymized summarization data. There are only non-anonymized summarization data and anonymized QA data from cnn-dailymail. I just wondering if there are open source code to obtain non-anonymized summarization data since it's widely used. Still thanks a lot.

AlJohri commented 6 years ago

The same dataset for QA was repurposed for summarization. If you look at generate_questions.py it should get you most of the way there.

https://github.com/deepmind/rc-data/blob/d305ea5de230e519a4d358232819c9291e286d66/generate_questions.py#L144-L156

https://github.com/deepmind/rc-data/blob/d305ea5de230e519a4d358232819c9291e286d66/generate_questions.py#L199-L210

https://github.com/deepmind/rc-data/blob/d305ea5de230e519a4d358232819c9291e286d66/generate_questions.py#L250-L259

https://github.com/deepmind/rc-data/blob/d305ea5de230e519a4d358232819c9291e286d66/generate_questions.py#L330-L338

https://github.com/deepmind/rc-data/blob/d305ea5de230e519a4d358232819c9291e286d66/generate_questions.py#L469-L478

Oscar860601 commented 6 years ago

@AlJohri Thanks! I will try to modify this code.