CurationCorp / curation-corpus

Code for obtaining the Curation Corpus abstractive text summarisation dataset
Creative Commons Attribution 4.0 International
122 stars 27 forks source link

Issues with the open-source dataset #4

Closed shreydesai closed 3 years ago

shreydesai commented 4 years ago

Hi Curation,

I would like to point that, in its current form, this dataset is almost unusable. The main problem is with the fact that users have to scrape documents from the original websites. Website changes, URL forwarding, paywalls, etc. inevitably cause a lot of web-related errors that manifest in the documents, which is very difficult to preprocess out. The provided preprocessing script (in its current state) does not do enough to filter out the noise. There are at least two larger problems with this:

1) Pre-trained models, like BERT, are very sensitive to the integrity of the document, and if there are spurious tokens or ill-formed sentences, the model will not be able to form coherent representations of the document. Seemingly minor things like [gallery ids=\"1318996,1318995,1318988,1318986,1319003\"] being injected in a document can cause the number of wordpieces to explode. This generally wouldn't be a big problem, but BERT (and most other pre-trained encoders) have a maximum sequence length of 512 wordpieces/subwords, so the encoder will not have a chance to see other sentences if a single sentence saturates the batch.

2) Results obtained on this dataset might not be reproducible given that the documents are not released with the original distribution. When I first used the scraper to collect this dataset (around the time when this dataset was published), I was only able to retrieve 39,917 documents, not 40k as is advertised on this repository. If someone were to run the scraper now, the recall may be substantially lower. I understand that there may be licensing issues associated with Curation releasing the documents along with the abstracts, but this is an important point for consideration -- people having different copies of this dataset (along with different content) will not be able to compare results in a scientifically meaningful way.

Finally, in the interest of providing constructive criticism, I'll point out some specific issues I saw in the documents that may be addressed by finer-grained quality control. These errors are not cherrypicked -- they are chosen from a random sample of about 100 documents/summaries.

1) HTML artifacts in the middle of a document

[DOCUMENT]
...
[gallery ids=\"1318996,1318995,1318988,1318986,1319003\"]
...

2) Non-document related content appears in the document

[DOCUMENT]
...
For more information, visit www.GoSafr.com. Contacts Elevate Communications, for SafrLucy Muscarella, 617-312-6411cell: 858-353-1359lmuscarella@elevatecom.comorSafrJoanna Humphrey Flynncell: 617-549-1718Marketing and PR Managerjoanna@gosafr.com

3) Tables get squeezed in with the rest of the document -- hard to preprocess out

[DOCUMENT]
...
Crude steel consumption2017*2018'2019f2020f2018'2019'2020' World steel consumption1.7011,7591.7621,7583.50.2-0.2 China7888107947762.8-1.9-2.3 European Union 281721751771791.81.21.0 United States1071111121114.01.0-1.0 India961021081155.36.16.3 Japan75737372-3.10.5-1.8 South Korea59595959-0.1-0.3-0.4 Russia43424242-0.90.30.0 Brazil222323230.71.71.5 Crude steel production20172018'2019f2020f2018'2019'2020' World steel production1.6891,7711.7681.7594.8-0.2-0.5 China8508868618424.2-2.8-2.2 European Union 281681721741752.40.60.9 Japan1051061081091.52.10.8 India1011081151236.56.86.9 United States828690905.44.30.1 Russia717272720.60.20.0 South Korea71717170-0.2-0.3-0.4 Brazil34343434-1.20.30.8 Notes: s estimate f forecast.

4) Document/summary pairs are incorrect

[DOCUMENT]
Published: 9:34am, 18 Jan, 2019Updated: 8:43pm, 18 Jan, 2019

[SUMMARY]
Italian insurance company Generali said it was ready to expand into Asia and Latin America after a restructuring that saw it sell unprofitable operations. A three-year strategy launched in November 2018 included a target of compound earnings per share annual growth of up to 8%. CEO Philippe Donnet said the firm was considering potential acquisitions of a bancassurance provider in Asia, a Central and Eastern European property and casualty insurer and a global health insurer. About three-quarters of Generalis business came from France, Germany and Italy but it already had a presence in 10 Asian markets.

5) Hitting a paywall

[DOCUMENT]
A recent event that received surprisingly little media attention serves as a reminder of a lurking cyber risk that is different in kind and scale than more widely and frequently reported privacy-related data breaches. Want to continue reading?Become a FreePropertyCasualty360 Digital Reader. INCLUDED IN A DIGITAL MEMBERSHIP: All PropertyCasualty360.com news coverage, best practices, and in-depth analysis. Educational webcasts, resources from industry leaders, and informative newsletters. Other award-winning websites including BenefitsPRO.com and ThinkAdvisor.com. Register Now Already have an account? Sign In Now

[SUMMARY]
A recent series of industrial fires in Iran, which have been blamed on hackers, has opened up the question of insuring against physical damage as a result of cyber breaches, according to Alex J Lathrop, a partner at law firmPillsbury Winthrop Shaw Pittman. Rather than relying on cyber insurance, which covers only data breaches, traditional commercial general liability, property and business interruption policies,where exclusions are not clearly indicated, should provide coverage for physical damage sustained during a cyber attack, he said.

6) Spurious client-side access errors

[DOCUMENT]
StackPath Please enable JavaScript This website is using a security service to protect itself from online attacks. The service requires full JavaScript support in order to view the website. Please enable JavaScript on your browser and try again. Reference ID: ad34175b419f6e80a9fe5cd1f5e57ab1

7) Summary in English, document is not

[DOCUMENT]
Regus slandi opnar dag, 24. janar kl. 17.00, formlega njan skrifstofukjarna 3. h Hafnartorgi. ar vera starfrktar 46 skrifstofur, fundarherbergi mismunandi strum og svi fyrir sameiginlega vinnuastu. Allar skrifstofur og starfsstvar eru afhentar viskiptavinum fullbnar me rafmagnsborum, skrifstofustlum og rum nausynlegum bnai, s.s. fjarskiptabnai, nettengingu og fleira segir frttatilkynningu fr flaginu. Skrifstofukjarni samanstendur af einkaskrifstofum, samnttum vinnusvum, setustofum, fundarherbergjum og fjarskrifstofum. tilefni af opnun skrifstofukjarnans hafa au Andrzej Mrozek-Folkierski, yfirmaur vrurunar Regus, og Roz Young, svisstjri Regus Evrpu, komi hinga til lands til a vera vi opnunina. Regus slandi rekur n egar fjra skrifstofukjarna hr landi undir merkjum Regus og Orange Project, rmla 4-6, Sktuvogi, Hfatorgi og Akureyri. Skrifstofukjarninn Hafnartorgi verur fimmti skrifstofukjarni flagsins. Tmas Hilmar Ragnarz, framkvmdastjri og eigandi Regus slandi opnunina vera skref inn framtina me ntmalegri skrifstofuastu. Fyrir utan a a vera vel stasett mibnum bur ntt hsni upp mikinn sveigjanleika, fallegt umhverfi og ga vinnuastu, segir Tmas Hilmar. a er ljst a eigendur fyrirtkja af llum strum og gerum horfa til ess a nta auknari mli sveigjanleika skrifstofurekstri me v a nta jnustu skrifstofukjarna.

[SUMMARY]
Regus has opened its fifth office in Iceland, located on the thirdfloor of Hafnartorg Kvosinn, Reykjavik. The new centre has 46 offices, meeting rooms and shared office space.

8) Lack of sentence separation

[DOCUMENT]
Wells Fargo picks four directors for sales scandal probe -source By Reuters Published: 18:22 EDT, 8 December 2016 | Updated: 18:22 EDT, 8 December 2016 By Dan FreedDec 8 (Reuters) - Wells Fargo & Co Chairman Stephen Sanger and Vice Chair Elizabeth Duke have been named to a four-member committee that will lead an internal investigation into the bank's recent sales scandal, a person familiar with the matter said on Thursday
...

9) Summaries are not well-formed (word separation issues)

[SUMMARY 1]
A diagram by the Bank of International Settlements (BIS) has revealed that China's shadow banking sector is even more indecipherable and complex than itsUS counterpart. In particular BIS pointed to uncertainty about who would befinancially responsible when adebt forequity swap defaulted.New and more complex structured shadow credit intermediation has emerged and quickly reached a large scale, notedBIS, with thiscomplexity packaged and sold through wealth management products. Ifthere were to be afinancial crash, this complexity would make it worse.

[SUMMARY 2]
The FAA has proposed its biggest fine of$1.9magainst SkyPan International, an aerial photography company,for illegally flyingdronesthroughbusy airspace above New York and Chicago. A total of 65 unauthorised flights were recorded over a two year period between 2012 to 2014.SkyPan failed to get a valid Certificate of Waiver for the flights, and 43 of those flights were over a tightly restricted Class B airspace in New York, without permission from air traffic control.

[SUMMARY 3]
The completion of new energy storage schemes with a combined 340.5 MW capacityin China in the first sixmonths of this year will almost equalthe total 389.4 MW capacity of energy storage facilitiesoperational in China at the end of 2017, according to the China Energy Storage Alliance (CNESA).The biggeststorage project built in H1 was actually eight linked lithium-ion battery modules, on one sitein Jiangsu Zhenjiang,addingup to 101 MW/202 MWh grid-connected capacity. The new facility started operations inJuly, CNESA said.

[SUMMARY 4]
Eurex Clearing is working on a clearing model that will enable buy-side members to clear directly, circumventing bank clearing members. Their plans have caused some concern with Europeanregulators over risk from lower-rated counterparties.

[SUMMARY 5]
Knotel is reportedly in talks with Wafra, a New York City-based investment firm owned by the Kuwaiti government pension fund, about an investment that would value the company at around$1.5bn. Wafra is expected to lead the funding round, potentially joined by Singapores sovereign-wealth fund. Knotel has expanded its portfolio of flexible office locations to around 200from 20 in early 2018. The discussion highlights strong interest in the sector from institutional investors.

10) Scraper does not handle errors gracefully

[DOCUMENT]
Exception

[SUMMARY]
Non-US companies have been warned there is an increased risk of enforcement action against them from the US's Securities and Exchange Commission (SEC). The warning came in an article in the Banking Law Journal following a ruling made earlier this year by the Court of Appeals for the Tenth Circuit in SEC v Scoville, which seemingly bolstered the regulator's extraterritorial enforcement authority under the Dodd-Frank Act. \"One risk created is that purely foreign transactions, which arguably influence the price of securities traded on United States exchanges, may be the subject of SEC enforcement actions\", the article noted.
HenryDashwood commented 4 years ago

Thanks for the feedback Shrey.

"I understand that there may be licensing issues associated with Curation releasing the documents along with the abstracts" - You're right. That is the main issue when trying to do a release like this. We've tried to copy the way the Deepmind released the CNN/DM dataset. Maybe someone will host the articles in a similar way. We would be delighted if they did!

I agree the scraper could be improved. It's something I'll keep working on. If you have any ideas for how it could be improved, or submit a pull request, I would be happy to credit you.

dakshvar22 commented 4 years ago

@HenryDashwood Does the large version of this corpus also suffer from similar issues? Given that, you are providing a commercial license for the larger dataset, is there some quality control that you have employed on the dataset to resolve some of the above listed issues?

HenryDashwood commented 3 years ago

New cleaner dataset with better scraper