Closed soroush-ziaeinejad closed 2 years ago
The code is running right now for 1,577,200 links (~10M dataset) and it takes about 2 days to be completed.
Steps:
Update: Based on the running progress, crawling all links takes about 5-7 days.
The crawler is running on Google Colab and is accessible via this link: https://colab.research.google.com/drive/1LTI77gzRlmKLYf06aoHXV7aDwl5XOqh_?usp=sharing
Hi @soroush-ziaeinejad code is crawling the current webpages of the URLs. What if the page is updated? We discussed that you can use archive's API to crawl the page during the timespan of the dataset.
also why the code is separate from the codebase? please put it in the seera repo in the apl layer. Assume there is a tweets_entities.csv in ./data or ./data/toy folder
Hi @hosseinfani,
I checked archive.org for a couple of links. First of all, it's too slow and it takes about 45 seconds to load a webpage. They have saved a lot of web pages in 40-60 snapshots for each URL and I think the search algorithm is the bottleneck. Secondly, it doesn't have all of our links and we will get a lot of 'Hrm. The Wayback Machine has not archived that URL.' messages. Thirdly, I randomly checked about 20 links and for those I found on Archive.org, there was no difference between current and initial content. So I think using the archive's API at this time doesn't worth it for us.
I pushed the code in apl layer last night. I just ran it outside the pipeline to have the desired News dataset.
Yes, there is a tweet_entities.csv in the data folder (15MB) but it's not the whole table. CSV of the whole tweet_entities table is 600MB.
@soroush-ziaeinejad ok then. the tweet_entities.csv in the data folder (15MB) ==> is this for the toy dataset of tweets? If not, make it consistent with the toy dataset.
@hosseinfani I will check. Right now, we have just the Tweets.csv in the toy data folder. I will add other tables as well.
@soroush-ziaeinejad I close this issue. Later, when you want to make it parallel, create a new one.
A piece of code to retrieve the news articles from the links.