Open Wazbat opened 2 years ago
How long does the actual download of the pages take when you're doing raw html scraping?
Is it possible to split the scrapping into two steps?
That way you can optimize them individually.
How long does the actual download of the pages take when you're doing raw html scraping?
Is it possible to split the scrapping into two steps?
- Download the pages you're interested in
- Scrape the data from these downloaded pages
That way you can optimize them individually.
Currently the scraping is already handled by two seperate jobs. One of them just scrapes the raw html which only takes 15 seconds, but it's the other httrack command that takes much much longer. You can see how long each run takes in the Actions tab, as well as the commands used here. The scrape-local
job is the one that takes a long time
I think it's a simple misconfiguration with httrack, but I'm not familiar with the tool
For some reason one of the scrape jobs takes ~17 minutes to complete, while the raw html scrape takes 15 seconds. Not sure what's causing this, but it's likely part of the httrack command. Might be a download speed limit, or it's scraping too deep?