CurationCorp / curation-corpus

Code for obtaining the Curation Corpus abstractive text summarisation dataset
Creative Commons Attribution 4.0 International
123 stars 27 forks source link

Dolt version of dataset #5

Closed timsehn closed 3 years ago

timsehn commented 4 years ago

Hi Curation,

This is Tim, the CEO of the company that built Dolt and DoltHub. Dolt is git semantics wrapped on top of a SQL database and DoltHub is a place to share those databases. We think this dataset makes a lot of sense on DoltHub.

I took the liberty of importing it (even with the scraped articles):

https://www.dolthub.com/repositories/Liquidata/curation-corpus

We thought Dolt might be an interesting tool for you to check out.

--Tim

timsehn commented 4 years ago

I was only able to get content for ~3,400 articles using the provided scraper:

timsehn$ dolt sql -q "select count(*) from articles where article_content != 'Exception'"
+----------+
| COUNT(*) |
+----------+
| 3394     |
+----------+
timsehn commented 4 years ago

Added a saved query for the above:

https://www.dolthub.com/repositories/Liquidata/curation-corpus/query/master?q=select%20count(*)%20as%20total_count%2C%20(select%20count(*)%20from%20articles%20where%20article_content%20!%3D%20%27Exception%27)%20as%20articles_with_content%20from%20articles