bigscience-workshop / catalogue_data

Scripts to prepare catalogue data
Apache License 2.0
8 stars 1 forks source link

Removing dataset lm_en_a_million_news_headlines_abc_australia #8

Open HugoLaurencon opened 2 years ago

HugoLaurencon commented 2 years ago

I think it would be better to remove this dataset from the list

Here are some random examples of documents, and they are all like that

Doc 0: adrian bayley minimum prison term extended 10 years over rapes

Doc 1: egg farm break in

Doc 2: stoner claims grand prix in portugal

Doc 3: palau typhoon bopha watch

Doc 4: dna breakthrough on unsolved rape

Doc 5: labor says mortgage stress at record high

Doc 6: concerns raised over carbon capture

Doc 7: nigeria to set up regional anti boko haram force

Doc 8: habib says torturers used information from

Doc 9: mixed bag for wine production

Doc 10: sue butler said it

Doc 11: dog hitches ride from queensland to sa

Doc 12: tests show beach algae harmless

Doc 13: more support sought for chamber of commerce

Doc 14: australia india engaged together to stop people

Doc 15: tim costello on financial crisis

Doc 16: push to save womens army camp ruins from roe highway extension

Doc 17: serial rapist convicted over knifepoint attacks

Doc 18: grandmother lorn cheng jailed for smuggling heroin from cambodia

Doc 19: fact check bradfield scheme barnaby joyce drought

Doc 20: gold coast man attacked with tomahawk

cakiki commented 2 years ago

Agreed.

TevenLeScao commented 2 years ago

Agreed. Any high-resource language document where over 80% of the documents are short should go either go or be massively filtered imo.