UTMediaCAT / mediacat-domain-crawler

Internet domain crawler
0 stars 0 forks source link

Issue: The crawler is crawling too slow, look for solutions to increase performance #19

Open jacqueline-chan opened 3 years ago

jacqueline-chan commented 3 years ago

Needs investigation some leads:

kstapelfeldt commented 3 years ago

@jacqueline-chan is going into the code to confirm async calls.

The reason why there might have been a memory leak is because apify default was setting too much RAM (60GB when we only had 40GB). Had to manually be set to 30 on prod. This only happens on production. It doesn't happen when running the application locally. Jacqueline to investigate. @RaiyanRahman will also take a look to advise if anything he thinks of. We still have a memory mystery to solve, but should know more after next week.

jacqueline-chan commented 3 years ago

We want to know how long one request will take or ten requests will take. Need some benchmarking @amygaoo will code this up for us

kstapelfeldt commented 3 years ago

@jacqueline-chan will run Cheerio @RaiyanRahman will explore how we might manage multiple concurrent puppeteer instances @amygaoo will also explore how we might manage multiple concurrent puppeteer instances

kstapelfeldt commented 3 years ago

@jacqueline-chan got cheerio working but the URLs it's crawling are not the URLs she expects it to crawl, so she is looking into that issue. It is much faster than puppeteer (rendering html only). Will troubleshoot with @RaiyanRahman to help in crawler selection.
@jacqueline-chan will restart puppeteer with setting max and min concurrency. Min concurrency should be set to 50. @RaiyanRahman read up how to add new instances of puppeteer and manage them. There are couple of different ways to do it. We are currently using apify sdk which manages puppeteer. However, we can use puppeteer directly and manage it on our own. We need to test this. Also, if the rendering is very slow we can selectively render certain elements. Will look into selective rendering that blocks media loading (images and video).

kstapelfeldt commented 3 years ago

@jacqueline-chan restarted with max concurrency but no speed benefit. Nat suggested we should get to the crux of the issue for puppeteer so she is writing tests to determine what is going on.

@jacqueline-chan and @RaiyanRahman did get Cheerio working but then it stopped. We need to run tests for both.

Two streams are being pursued: 1 is tests/configs for the Puppeteer crawler (@RaiyanRahman) and 2 is tests for Cheerio (@jacqueline-chan).

@jacqueline-chan will share written tests with @RaiyanRahman and @AlAndr04 as they should be able to be applied to both crawlers. @AlAndr04 will also look at the issue of why the puppeteer crawler is not managing resources and concurrency as it should.

kstapelfeldt commented 3 years ago

@RaiyanRahman & @jacqueline-chan Two issues remain: too slow, and not getting links back (only single links)

  1. hard-code https:// in puppeteer & cheerio and restarting crawl to see if this resolves the problem of bringing back only single links.
  2. fold in selective rendering code for puppeteer (after it's complete) (still need to be able to better block video)
  3. Look into manually queueing - do we need to modify in order to resolve issue? Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats
jacqueline-chan commented 3 years ago
  1. Look into manually queueing - do we need to modify in order to resolve issue? Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

This third part is completed and is now in testing. While discussing with @RaiyanRahman, we determined that solving task#3 will consequently solve the issue that task#1 was meant to fix and therefore task#1 would be redundant.

Manually enqueuing links has caused an issue of where the crawler does stop periodically crawling and needs to be manually restarted after crawling a couple thousand links. Will need to debug some more + fix this bug

Planning to do a large crawl with cheerio this weekend or first thing monday

kstapelfeldt commented 3 years ago

@RaiyanRahman could not make today's meeting. @jacqueline-chan implemented number three. When manual queuing is implemented, the crawler does not function as it would using self-derived links. There remains an implementation issue. How could we troubleshoot this? Nat suggests two ways forward: 1. contact developers/github repo maintainers/community or 2. determine how to mark the manual queuing so it is identical to the internal queuing process. This is an issue with cheerio.

Get site you know will fail Write tests that focus on the queue.

https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

jacqueline-chan commented 3 years ago

@RaiyanRahman please join us in debugging the manually enqueuing issue

jacqueline-chan commented 3 years ago

Suggestions from Nat:

use the simplest puppeteer crawler to determine fails/edge cases

  1. how do we handle pop ups / privacy issues (Accept or close) (Use user agent). Ways to recognize pop ups for special cases per site. -- avoid javascript ?? pop up? (have users accept). Have a library/extension. Touch base with Raiyan.
  2. paywalls -- probably have to use api
  3. save asynchronously
  4. give batches
kstapelfeldt commented 3 years ago

@RaiyanRahman refactored selective loading and tested the solution. Will this work for pop-ups? Check out: https://www.tubantia.nl/ is an example.

Delay/speed may be related to the process of saving files. We can't run asynchronously because we synchronously save files. Does our database solve this problem?

Notes: "I don't care about Cookies" doesn't work on Chrome, but the bigger issue is that we're using apify puppeteer (which makes it difficult to add extensions) - https://chrome.google.com/webstore/detail/i-dont-care-about-cookies/fihnjjcciajhdojfnbdddfaoknhalnja?hl=en A lot of people are asking puppeteer and apify to implement it, and there is a 'stealth' function that has been developed and could be looked into. @RaiyanRahman might be able to look into this a little further.

For loop might be slowing us down too.

@jacqueline-chan:

Priority one: Trying to solve pop-up problem - try using @RaiyanRahman's code first (new branch already pushed) Priority two: Speed problem - move all processing to a function for later so the crawler doesn't wait on it.

@RaiyanRahman

Priority one: will stealth mode work in puppeteer (I don't care about cookies extension) Priority two: based on collaboration with Jacqueline, do we need to handle pop-ups/cookies elsewhere in the code?

jacqueline-chan commented 3 years ago

As discussion with @RaiyanRahman, stealth mode and blocking request for extra resources does not work.

Will need to manually click on each accept button for now, @RaiyanRahman will help look for a way to automate that.

Both of us is going to try and write scripts to manually accept the consent forms. I will make a list of problematic links so far for us to test.

kstapelfeldt commented 3 years ago

@RaiyanRahman - looking into other extensions that might resolve the issue of pop-ups. Extensions will work with specific websites, but not generally. @Natkeeran advises that we develop a list of 10-15 sites with this problem and assess the problem for these sites: How to identify? How to close? (put into a .csv). Based on this, we can make decisions about how to address in code - are there any features that seem common to all of these sites?

@jacqueline-chan and @RaiyanRahman will split up target links and put into a sheets doc or something similar (and link in this ticket).

jacqueline-chan commented 3 years ago

Jacqueline

Raiyan

URLs with pop ups:

URLs that exits immediately - problem unknown

URLs that sometimes work -[ ] https://truthout.org/

URLS that take a long time to load (but still ultimately crawls)

URL fixes http://aljazeerah.info/ -> https://aljazeerah.info/ http://buffalonews.com/ - > https://buffalonews.com/ http://chinadaily.com.cn/ -> http://global.chinadaily.com.cn/ http://dailynewsegypt.com/ -> https://dailynewsegypt.com/ http://electronicintifada.net/ -> https://electronicintifada.net/

URLs that require sign in / human verification -[ ] https://bismarcktribune.com/

kstapelfeldt commented 3 years ago

@RaiyanRahman - inconsistent behaviour when trying to close pop-ups (solutions seem to work sometimes but not others) https://derstandard.at/ -

@jacqueline-chan has to get database up to get this data. But if she can't get it working will have to restart and run for two days. She did look into one of the links which has a pop-up and it should have been really easy to click away and click accept but for some reason it still cannot find the button to click away. Can't search for name of button.

Todo

@jacqueline-chan trying to retrieve database from preliminary crawl @jacqueline-chan and @RaiyanRahman trying to resolve URL issues @jacqueline-chan and @RaiyanRahman we need a Google Sheet spreadsheet with all the problematic URLs to make collaboration and documentation easier.

jacqueline-chan commented 3 years ago

@jacqueline-chan @RaiyanRahman Some url links that only has one hit, hit much more frequently when it is tested on its own (without any other domain in the queue). Therefore @RaiyanRahman would like to explore batching as an option for now. and also look into the possibility of using just using Puppeteer/Playwright without apify to give us more control on the queue mechanism

kstapelfeldt commented 3 years ago

@jacqueline-chan - yesterday she was able to retrieve the database! She will make a CSV/Google Sheet - copy/ticket @jacqueline-chan and @RaiyanRahman trying to resolve URL issues - @Natkeeran reviewed a website that @jacqueline-chan was debugging and found that they are deliberately hiding pop-up (to prevent crawling). This will be difficult. @RaiyanRahman found his site to be very inconsistent. 50% of the time a button could not be found.

@RaiyanRahman suggests we try running batches of websites to see if this improves behaviour. @jacqueline-chan will try running small batches manually to start to see if this makes a difference prior to any code development. If this works, write code. If this doesn't work, don't use apify and write own queuing mechanism.

jacqueline-chan commented 3 years ago

CSV for the database. How I determine that a link is most likely a pop up issue is if it starts with https and only has a few hits

go to this site and click on download csv

http://199.241.167.146/

kstapelfeldt commented 3 years ago

Imported on the sheet here: https://docs.google.com/spreadsheets/d/1DJfiLT7XGL0XXttp8q0BKRZkn6swWAQ74gdlcaYG3CI/edit#gid=241833616

kstapelfeldt commented 3 years ago

@RaiyanRahman

  1. trying to automate batching in apify
  2. look for content of the a tag instead of title.

@jacqueline-chan to:

kstapelfeldt commented 3 years ago

Overall, we believe that we should explore running multiple instances on separate VMs concurrently, including Raiyan's improvements to the batching process.

kstapelfeldt commented 3 years ago

@RaiyanRahman is seeking an optimized point for batches and page crawls per batch. Priority is an exhaustive crawl of individual domains if at all possible.

@jacqueline-chan been pruning branches and working on compute canada instances and documentation (tasks above)

kstapelfeldt commented 3 years ago

@RaiyanRahman - Looping through domains individually is the best approach. @todo - Making the crawl more robust for subsequent crawls. Raiyan has found a sample implementation in documentation that he is working on.

kstapelfeldt commented 3 years ago

@RaiyanRahman has completed a refactor of the queuing system which runs locally (for the most part) but installing in Graham cloud raised specific issues:

  1. After two days of running, infinite loop
  2. Some domains only have a single page crawled and subsequent links are not added to the queue
  3. JSON not being returned for some crawled links
  4. Two specific domains did not have a results folder created
kstapelfeldt commented 3 years ago

Strategy for finding out more:

  1. Re-run crawl with only working domains to see if infinite loop problem persists.
  2. Install two more versions of the code base
  3. Run one new machine with a single problematic URL
  4. Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic
kstapelfeldt commented 3 years ago
  1. Re-run crawl with only working domains to see if infinite loop problem persists.
  2. Install two more versions of the code base
  3. Run one new machine with a single problematic URL
  4. Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic ran the whole scope.

Equal importance was given to working domains. Most domains didn't work because of a memory issue - call stack gets too big and takes things down. Tested alongside a local instance. The behaviour was actually the same. NY times was tested on a separate machine.

In 48 hours:

Next steps:

kstapelfeldt commented 3 years ago
kstapelfeldt commented 3 years ago

Time notes: 2 days = almost 15,000 links and had over 120,000 links left in the queue, and created in the mid-10,000s for JSON crawls. Tried a couple of different naming conventions.

Raiyan has implemented a timestamp solution to stop JSON files being overwritten, and this has reduced the number of JSON files that are missed, and we anticipate that this solution means that we won't have JSON overwrite problems. For the JSON issues going forward, we'll need to implement a check after the data goes to the post-processor to see which found URLs did not result in JSON files.

The crawler still runs out of stack memory and then stops, and needs to be restarted. Raiyan is working on a mechanism for automating the restarting process. Raiyan will have a meeting next week with the dev team on the log memory issue.

Raiyan will spend time to understand compute canada resources so we understand how much space is available for storing data.

kstapelfeldt commented 3 years ago

Compute Canada: There are lots of folders hidden with space associated with them. We should check these folders before setting up an instance.

12000 after 24 hours. 8.5 links per minute and faster (15 links per minute) when there is no issue. Did some refactoring per Nat's suggestion. Crawls to 24 hour mark and then has trouble opening new pages in browser. After 24 hours it happens frequently. Added timestamps and that made debug file a lot easier. If we prevent new page timeout issue, this will speed up a lot and take care of a lot of our issues. Stealth crawler might cause problems. JSON files are now all being created!! - script counts all pages crawled and it matches.