Wazbat / neos-website-archive

Archive of the neos.com website
2 stars 0 forks source link

Local friendly version of the website scrape takes too long #3

Open Wazbat opened 2 years ago

Wazbat commented 2 years ago

For some reason one of the scrape jobs takes ~17 minutes to complete, while the raw html scrape takes 15 seconds. Not sure what's causing this, but it's likely part of the httrack command. Might be a download speed limit, or it's scraping too deep?

lodenrogue commented 2 years ago

How long does the actual download of the pages take when you're doing raw html scraping?

Is it possible to split the scrapping into two steps?

  1. Download the pages you're interested in
  2. Scrape the data from these downloaded pages

That way you can optimize them individually.

Wazbat commented 2 years ago

How long does the actual download of the pages take when you're doing raw html scraping?

Is it possible to split the scrapping into two steps?

  1. Download the pages you're interested in
  2. Scrape the data from these downloaded pages

That way you can optimize them individually.

Currently the scraping is already handled by two seperate jobs. One of them just scrapes the raw html which only takes 15 seconds, but it's the other httrack command that takes much much longer. You can see how long each run takes in the Actions tab, as well as the commands used here. The scrape-local job is the one that takes a long time

I think it's a simple misconfiguration with httrack, but I'm not familiar with the tool