I'm trying to use this dataset for my research.
I investigated some tweets and I found that some are not related to news at all.
For example, in the real category of politifact, articles of CQ.com had so many japanese tweets with https://t.co/XXXXXX.
politifact8005 is one of CQ.com's articles and this has many tweets but mostly are just applying for promotional marketing campaigns (example tweet id: 1021190359525847040). Other tweets also refer to completely unrelated topics.
Also, I believe that news content.json contains a login error. Instead of containing the data, they only contain the text of the login page:
Need help? Contact the CQ Hotline at (800) 678-8511 or hotline@cqrollcall.com
I can confirm similar phenomenon in all the other categories.
Is this intended? I am currently filtering those cases by using unicodedata.east_asian_width().
Dear @KaiDMML
I'm trying to use this dataset for my research. I investigated some tweets and I found that some are not related to news at all.
For example, in the real category of politifact, articles of CQ.com had so many japanese tweets with
https://t.co/XXXXXX
.politifact8005
is one of CQ.com's articles and this has many tweets but mostly are just applying for promotional marketing campaigns (example tweet id: 1021190359525847040). Other tweets also refer to completely unrelated topics.Also, I believe that
news content.json
contains a login error. Instead of containing the data, they only contain the text of the login page:I can confirm similar phenomenon in all the other categories. Is this intended? I am currently filtering those cases by using
unicodedata.east_asian_width()
.