Issue: The crawler is crawling too slow, look for solutions to increase performance

jacqueline-chan commented 3 years ago

Needs investigation some leads:

kubernetes ?
double check if something is holding up the async calls in the crawler
more servers?
different way of submitting batches/multithreading

kstapelfeldt commented 3 years ago

@jacqueline-chan is going into the code to confirm async calls.

The reason why there might have been a memory leak is because apify default was setting too much RAM (60GB when we only had 40GB). Had to manually be set to 30 on prod. This only happens on production. It doesn't happen when running the application locally. Jacqueline to investigate. @RaiyanRahman will also take a look to advise if anything he thinks of. We still have a memory mystery to solve, but should know more after next week.

jacqueline-chan commented 3 years ago

We want to know how long one request will take or ten requests will take. Need some benchmarking @amygaoo will code this up for us

kstapelfeldt commented 3 years ago

@jacqueline-chan will run Cheerio @RaiyanRahman will explore how we might manage multiple concurrent puppeteer instances @amygaoo will also explore how we might manage multiple concurrent puppeteer instances

kstapelfeldt commented 3 years ago

@jacqueline-chan got cheerio working but the URLs it's crawling are not the URLs she expects it to crawl, so she is looking into that issue. It is much faster than puppeteer (rendering html only). Will troubleshoot with @RaiyanRahman to help in crawler selection.
@jacqueline-chan will restart puppeteer with setting max and min concurrency. Min concurrency should be set to 50. @RaiyanRahman read up how to add new instances of puppeteer and manage them. There are couple of different ways to do it. We are currently using apify sdk which manages puppeteer. However, we can use puppeteer directly and manage it on our own. We need to test this. Also, if the rendering is very slow we can selectively render certain elements. Will look into selective rendering that blocks media loading (images and video).

kstapelfeldt commented 3 years ago

@jacqueline-chan restarted with max concurrency but no speed benefit. Nat suggested we should get to the crux of the issue for puppeteer so she is writing tests to determine what is going on.

@jacqueline-chan and @RaiyanRahman did get Cheerio working but then it stopped. We need to run tests for both.

Two streams are being pursued: 1 is tests/configs for the Puppeteer crawler (@RaiyanRahman) and 2 is tests for Cheerio (@jacqueline-chan).

@jacqueline-chan will share written tests with @RaiyanRahman and @AlAndr04 as they should be able to be applied to both crawlers. @AlAndr04 will also look at the issue of why the puppeteer crawler is not managing resources and concurrency as it should.

kstapelfeldt commented 3 years ago

@RaiyanRahman & @jacqueline-chan Two issues remain: too slow, and not getting links back (only single links)

hard-code https:// in puppeteer & cheerio and restarting crawl to see if this resolves the problem of bringing back only single links.
fold in selective rendering code for puppeteer (after it's complete) (still need to be able to better block video)
Look into manually queueing - do we need to modify in order to resolve issue? Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

jacqueline-chan commented 3 years ago

Look into manually queueing - do we need to modify in order to resolve issue? Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

This third part is completed and is now in testing. While discussing with @RaiyanRahman, we determined that solving task#3 will consequently solve the issue that task#1 was meant to fix and therefore task#1 would be redundant.

Manually enqueuing links has caused an issue of where the crawler does stop periodically crawling and needs to be manually restarted after crawling a couple thousand links. Will need to debug some more + fix this bug

Planning to do a large crawl with cheerio this weekend or first thing monday

kstapelfeldt commented 3 years ago

@RaiyanRahman could not make today's meeting. @jacqueline-chan implemented number three. When manual queuing is implemented, the crawler does not function as it would using self-derived links. There remains an implementation issue. How could we troubleshoot this? Nat suggests two ways forward: 1. contact developers/github repo maintainers/community or 2. determine how to mark the manual queuing so it is identical to the internal queuing process. This is an issue with cheerio.

Get site you know will fail Write tests that focus on the queue.

https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

jacqueline-chan commented 3 years ago

@RaiyanRahman please join us in debugging the manually enqueuing issue

jacqueline-chan commented 3 years ago

Suggestions from Nat:

use the simplest puppeteer crawler to determine fails/edge cases

how do we handle pop ups / privacy issues (Accept or close) (Use user agent). Ways to recognize pop ups for special cases per site. -- avoid javascript ?? pop up? (have users accept). Have a library/extension. Touch base with Raiyan.
paywalls -- probably have to use api
save asynchronously
give batches

kstapelfeldt commented 3 years ago

@RaiyanRahman refactored selective loading and tested the solution. Will this work for pop-ups? Check out: https://www.tubantia.nl/ is an example.

Delay/speed may be related to the process of saving files. We can't run asynchronously because we synchronously save files. Does our database solve this problem?

Notes: "I don't care about Cookies" doesn't work on Chrome, but the bigger issue is that we're using apify puppeteer (which makes it difficult to add extensions) - https://chrome.google.com/webstore/detail/i-dont-care-about-cookies/fihnjjcciajhdojfnbdddfaoknhalnja?hl=en A lot of people are asking puppeteer and apify to implement it, and there is a 'stealth' function that has been developed and could be looked into. @RaiyanRahman might be able to look into this a little further.

For loop might be slowing us down too.

Still focusing on puppeteer first

@jacqueline-chan:

Priority one: Trying to solve pop-up problem - try using @RaiyanRahman's code first (new branch already pushed) Priority two: Speed problem - move all processing to a function for later so the crawler doesn't wait on it.

@RaiyanRahman

Priority one: will stealth mode work in puppeteer (I don't care about cookies extension) Priority two: based on collaboration with Jacqueline, do we need to handle pop-ups/cookies elsewhere in the code?

jacqueline-chan commented 3 years ago

As discussion with @RaiyanRahman, stealth mode and blocking request for extra resources does not work.

Will need to manually click on each accept button for now, @RaiyanRahman will help look for a way to automate that.

Both of us is going to try and write scripts to manually accept the consent forms. I will make a list of problematic links so far for us to test.

kstapelfeldt commented 3 years ago

@RaiyanRahman - looking into other extensions that might resolve the issue of pop-ups. Extensions will work with specific websites, but not generally. @Natkeeran advises that we develop a list of 10-15 sites with this problem and assess the problem for these sites: How to identify? How to close? (put into a .csv). Based on this, we can make decisions about how to address in code - are there any features that seem common to all of these sites?

@jacqueline-chan and @RaiyanRahman will split up target links and put into a sheets doc or something similar (and link in this ticket).

jacqueline-chan commented 3 years ago

Jacqueline

[ ] https://www.tubantia.nl/

Raiyan

[ ] https://derstandard.at/

URLs with pop ups:

[ ] https://www.tubantia.nl/
[ ] https://bismarcktribune.com/
[ ] https://derstandard.at/
[ ] https://www.economist.com/ (kinda works)
[ ] https://www.dailybulletin.com/ (kinda works)
[ ] https://truthout.org/

URLs that exits immediately - problem unknown

[ ] http://bernama.com/en/ -> https://www.baltimoresun.com/
[ ] http://calcalist.co.il/

URLs that sometimes work -[ ] https://truthout.org/

URLS that take a long time to load (but still ultimately crawls)

URL fixes http://aljazeerah.info/ -> https://aljazeerah.info/ http://buffalonews.com/ - > https://buffalonews.com/ http://chinadaily.com.cn/ -> http://global.chinadaily.com.cn/ http://dailynewsegypt.com/ -> https://dailynewsegypt.com/ http://electronicintifada.net/ -> https://electronicintifada.net/

URLs that require sign in / human verification -[ ] https://bismarcktribune.com/

kstapelfeldt commented 3 years ago

@RaiyanRahman - inconsistent behaviour when trying to close pop-ups (solutions seem to work sometimes but not others) https://derstandard.at/ -

@jacqueline-chan has to get database up to get this data. But if she can't get it working will have to restart and run for two days. She did look into one of the links which has a pop-up and it should have been really easy to click away and click accept but for some reason it still cannot find the button to click away. Can't search for name of button.

Todo

@jacqueline-chan trying to retrieve database from preliminary crawl @jacqueline-chan and @RaiyanRahman trying to resolve URL issues @jacqueline-chan and @RaiyanRahman we need a Google Sheet spreadsheet with all the problematic URLs to make collaboration and documentation easier.

jacqueline-chan commented 3 years ago

@jacqueline-chan @RaiyanRahman Some url links that only has one hit, hit much more frequently when it is tested on its own (without any other domain in the queue). Therefore @RaiyanRahman would like to explore batching as an option for now. and also look into the possibility of using just using Puppeteer/Playwright without apify to give us more control on the queue mechanism

kstapelfeldt commented 3 years ago

@jacqueline-chan - yesterday she was able to retrieve the database! She will make a CSV/Google Sheet - copy/ticket @jacqueline-chan and @RaiyanRahman trying to resolve URL issues - @Natkeeran reviewed a website that @jacqueline-chan was debugging and found that they are deliberately hiding pop-up (to prevent crawling). This will be difficult. @RaiyanRahman found his site to be very inconsistent. 50% of the time a button could not be found.

@RaiyanRahman suggests we try running batches of websites to see if this improves behaviour. @jacqueline-chan will try running small batches manually to start to see if this makes a difference prior to any code development. If this works, write code. If this doesn't work, don't use apify and write own queuing mechanism.

jacqueline-chan commented 3 years ago

CSV for the database. How I determine that a link is most likely a pop up issue is if it starts with https and only has a few hits

go to this site and click on download csv

http://199.241.167.146/

kstapelfeldt commented 3 years ago

Imported on the sheet here: https://docs.google.com/spreadsheets/d/1DJfiLT7XGL0XXttp8q0BKRZkn6swWAQ74gdlcaYG3CI/edit#gid=241833616

kstapelfeldt commented 3 years ago

@RaiyanRahman

trying to automate batching in apify
look for content of the a tag instead of title.

@jacqueline-chan to:

[ ] 1. look and see if there are other manual queues already written that we might use (to swap out apify)
[x] 2. send notification if apify crashes (find code and push)
[x] 3. Output list of problematic sites. We need some way to spit out a file of URLs that don't get crawled, as we're going to have some in the end.
[x] 4. bug fix, take in the full scope, but only run the batch according to the batch id
[x] 5. run ny times as a batch and twitter crawler

kstapelfeldt commented 3 years ago

[ ] @RaiyanRahman did work in batching system for puppeteer using apify, but needs to mark batches at done. He starts by splitting the scope into smaller batches and runs them in succession. Once one is done, he begins the next. He has more than 15 hours left and believes this can be done in that time.

Overall, we believe that we should explore running multiple instances on separate VMs concurrently, including Raiyan's improvements to the batching process.

[ ] @jacqueline-chan to find the documentation, update, and pass it to @kstapelfeldt for the new co-op student. Current documentation is here: https://github.com/UTMediaCAT/mediacat-backend/tree/master/commandline
[ ] will need a build script?
- [x] mediacat_hidden
- [x] mediacat_domain_crawler
- [ ] mediacat-frontend
- [ ] mediacat-backend
- [ ] mediacat-twitter-crawler
- [ ] mediacat-twitter-API
[ ] @jacqueline-chan to contact compute-canada about imaging existing machines to replicate them, which is currently not working (CC Alejandro and Kirsta)
- [x] email sent
- [ ] follow up to email
- [ ] debug problem
- [ ] backup current instance
- [ ] try to make 5 min instances
[x] @jacqueline-chan and @RaiyanRahman to have a meeting and merge everything possible to master. prune branches, please.

kstapelfeldt commented 3 years ago

@RaiyanRahman is seeking an optimized point for batches and page crawls per batch. Priority is an exhaustive crawl of individual domains if at all possible.

@jacqueline-chan been pruning branches and working on compute canada instances and documentation (tasks above)

kstapelfeldt commented 3 years ago

@RaiyanRahman - Looping through domains individually is the best approach. @todo - Making the crawl more robust for subsequent crawls. Raiyan has found a sample implementation in documentation that he is working on.

kstapelfeldt commented 3 years ago

@RaiyanRahman has completed a refactor of the queuing system which runs locally (for the most part) but installing in Graham cloud raised specific issues:

After two days of running, infinite loop
Some domains only have a single page crawled and subsequent links are not added to the queue
JSON not being returned for some crawled links
Two specific domains did not have a results folder created

kstapelfeldt commented 3 years ago

Strategy for finding out more:

Re-run crawl with only working domains to see if infinite loop problem persists.
Install two more versions of the code base
Run one new machine with a single problematic URL
Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic

kstapelfeldt commented 3 years ago

Re-run crawl with only working domains to see if infinite loop problem persists.
Install two more versions of the code base
Run one new machine with a single problematic URL
Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic ran the whole scope.

Equal importance was given to working domains. Most domains didn't work because of a memory issue - call stack gets too big and takes things down. Tested alongside a local instance. The behaviour was actually the same. NY times was tested on a separate machine.

In 48 hours:

40,421 Crawled Links
9,507 JSON
14 URLs a minute (low average)

Next steps:

Revise hash formula to guarantee unique IDS so that JSON is not overwritten (change naming convention)
Get a notification sent when the machine a) goes down b) thinks it's finished.
Try running in parallel in the same instance (use all machines to maximize use of time)
Keep problem domain spreadsheet updated: https://docs.google.com/spreadsheets/d/19O1l3A_qB9AqVCPHx1VCB7EyBOMgejx4lhbTukRvZJE/edit#gid=1391855376

kstapelfeldt commented 3 years ago

Raiyan implemented a different naming convention (UUID and millisecond timestamp). Pushed changes and updated Graham cloud. Seems to be better but still a misfit between # of JSON being produced and number of links crawled. Logs show when it happens, but still figuring out how to avoid issue.
Next step: Keep working to find out when infinite loop thing happens: monitor crawl? - Raiyan has a couple of ideas.

kstapelfeldt commented 3 years ago

Time notes: 2 days = almost 15,000 links and had over 120,000 links left in the queue, and created in the mid-10,000s for JSON crawls. Tried a couple of different naming conventions.

Raiyan has implemented a timestamp solution to stop JSON files being overwritten, and this has reduced the number of JSON files that are missed, and we anticipate that this solution means that we won't have JSON overwrite problems. For the JSON issues going forward, we'll need to implement a check after the data goes to the post-processor to see which found URLs did not result in JSON files.

The crawler still runs out of stack memory and then stops, and needs to be restarted. Raiyan is working on a mechanism for automating the restarting process. Raiyan will have a meeting next week with the dev team on the log memory issue.

Raiyan will spend time to understand compute canada resources so we understand how much space is available for storing data.

kstapelfeldt commented 3 years ago

Compute Canada: There are lots of folders hidden with space associated with them. We should check these folders before setting up an instance.

12000 after 24 hours. 8.5 links per minute and faster (15 links per minute) when there is no issue. Did some refactoring per Nat's suggestion. Crawls to 24 hour mark and then has trouble opening new pages in browser. After 24 hours it happens frequently. Added timestamps and that made debug file a lot easier. If we prevent new page timeout issue, this will speed up a lot and take care of a lot of our issues. Stealth crawler might cause problems. JSON files are now all being created!! - script counts all pages crawled and it matches.

Experiment to see what puppeteer configuration flag might be causing the new page timeout issue.
Consider stopping and restarting crawl with a separate script scheduled to fire when the issue of new page timeout appears.

UTMediaCAT / mediacat-domain-crawler

Issue: The crawler is crawling too slow, look for solutions to increase performance #19