fani-lab / SEERa

A framework to predict the future user communities in a text streaming social network based on the users’ topics of interest.
Other
4 stars 5 forks source link

Stats of the toy news dataset + TAGME issue #37

Closed soroush-ziaeinejad closed 2 years ago

soroush-ziaeinejad commented 2 years ago

@hosseinfani,

The stats of the toy news dataset is like this:

rows 33,788
text-available rows 26,447
average text length (words) 401.7
title-available rows 31,631
average title length (words) 6.6
description-available rows 22,624
average description length (words) 23.7

Now we need the TagMe annotated data. I can run the TagMe API for the text of news articles, or we can use the words of the titles of the news articles. Titles almost contain important words about the content. Please let me know your comments on this. Thanks.

hosseinfani commented 2 years ago

@soroush-ziaeinejad thank you. Agree. In the next iteration, we have to refactor the pipeline when tagme=true is selected. Just for our future reference, if tagme selected, we have to do it in a lazy load way, that is, load the tweets/news that have tagme annotations if exist, otherwise 1) tagme each tweet/news, 2) save them finally in ./data/toy/preprocessed/tweets|news.tagme.csv

please do the following: