cuilimeng / CoAID

102 stars 47 forks source link

Articles content often contains 'please sign in' etc. #7

Open chrisendacott opened 3 years ago

chrisendacott commented 3 years ago

Article content in each release contains huge amounts of erroneous data in content. Also, all article content is cut to 490 characters. was this dataset the one used to get benchmarks in the paper? could we have the original data?

cuilimeng commented 3 years ago

We used the 05-01-2020 data for experiments in the arXiv paper. For the unavailable contents we used abstracts or titles instead. We only saved the first 500 chars in each article.