fani-lab / SEERa

A framework to predict the future user communities in a text streaming social network based on the users’ topics of interest.
Other
4 stars 5 forks source link

News Crawling #36

Closed soroush-ziaeinejad closed 2 years ago

soroush-ziaeinejad commented 2 years ago

A piece of code to retrieve the news articles from the links.

soroush-ziaeinejad commented 2 years ago

The code is running right now for 1,577,200 links (~10M dataset) and it takes about 2 days to be completed.

soroush-ziaeinejad commented 2 years ago

Steps:

  1. TweetEntities loaded
  2. NaN entities dropped
  3. Duplicate URLs dropped
  4. For each chunk of data (Expanded URLs), 4 URLs (short, expanded, display, source) + text + title + description + publication timestamp are retrieved and stored.

Update: Based on the running progress, crawling all links takes about 5-7 days.

The crawler is running on Google Colab and is accessible via this link: https://colab.research.google.com/drive/1LTI77gzRlmKLYf06aoHXV7aDwl5XOqh_?usp=sharing

hosseinfani commented 2 years ago

Hi @soroush-ziaeinejad code is crawling the current webpages of the URLs. What if the page is updated? We discussed that you can use archive's API to crawl the page during the timespan of the dataset.

also why the code is separate from the codebase? please put it in the seera repo in the apl layer. Assume there is a tweets_entities.csv in ./data or ./data/toy folder

soroush-ziaeinejad commented 2 years ago

Hi @hosseinfani,

I checked archive.org for a couple of links. First of all, it's too slow and it takes about 45 seconds to load a webpage. They have saved a lot of web pages in 40-60 snapshots for each URL and I think the search algorithm is the bottleneck. Secondly, it doesn't have all of our links and we will get a lot of 'Hrm. The Wayback Machine has not archived that URL.' messages. Thirdly, I randomly checked about 20 links and for those I found on Archive.org, there was no difference between current and initial content. So I think using the archive's API at this time doesn't worth it for us.

I pushed the code in apl layer last night. I just ran it outside the pipeline to have the desired News dataset.

Yes, there is a tweet_entities.csv in the data folder (15MB) but it's not the whole table. CSV of the whole tweet_entities table is 600MB.

hosseinfani commented 2 years ago

@soroush-ziaeinejad ok then. the tweet_entities.csv in the data folder (15MB) ==> is this for the toy dataset of tweets? If not, make it consistent with the toy dataset.

soroush-ziaeinejad commented 2 years ago

@hosseinfani I will check. Right now, we have just the Tweets.csv in the toy data folder. I will add other tables as well.

hosseinfani commented 2 years ago

@soroush-ziaeinejad I close this issue. Later, when you want to make it parallel, create a new one.