Stats of the toy news dataset + TAGME issue

fani-lab / SEERa

A framework to predict the future user communities in a text streaming social network based on the users’ topics of interest.

Other

4 stars 5 forks source link

rows	33,788
text-available rows	26,447
average text length (words)	401.7
title-available rows	31,631
average title length (words)	6.6
description-available rows	22,624
average description length (words)	23.7

rows

33,788

text-available rows

26,447

average text length (words)

401.7

title-available rows

31,631

average title length (words)

6.6

description-available rows

22,624

average description length (words)

23.7

@soroush-ziaeinejad thank you. Agree. In the next iteration, we have to refactor the pipeline when tagme=true is selected. Just for our future reference, if tagme selected, we have to do it in a lazy load way, that is, load the tweets/news that have tagme annotations if exist, otherwise 1) tagme each tweet/news, 2) save them finally in ./data/toy/preprocessed/tweets|news.tagme.csv

please do the following:

rename NewNews.csv to News.csv
add the code that extracts stats from news file as a function in news class
put the news stat table in ./data/toy/readme.md
add the code that extracts stats from tweets file as a function in DataReader class. For now, just 2-3 simple stats like #tweets, avg#tweets/day, #users.
put the tweets stat table in ./data/toy/readme.md
put the news crawling as part of the pipeline in apl layer as lazy load, that is, load ./data/toy/news.csv if exists, otherwise, 1) start crawling the tweet's URLs, 2) save the crawled pages in ./data/toy/news.csv

fani-lab / SEERa

Stats of the toy news dataset + TAGME issue #37