Closed EwoutH closed 1 year ago
It has been failing since 2022-10-26, the last successful run was 2022-10-25.
After testing locally it seems that it doesn't like running headless anymore. Setting headless False
fixes the issue locally, let's see if it also works in CI.
CI doesn't like to run headless. Probably a software limitation of the CI environment. So two options:
Both have their challenges, but 2 can be quite difficult since headless works so you can't use it to debug it.
The main problem is that Routescanner starts denying HTTP requests at some point. From that point, we can either wait a very long time (keeping a CI runner busy, and possibly time-out), try to appear as another instance (using a VPN or proxy) or use a recursive action that spins up a new runner when request starts to be denied.
We basically need something that makes sure another machine token is returned (which happens here).
Using proxies could be the simplest implementation, since no new CI runners have to be spun up and thus no recursiveness is needed. @averbraeck, have you worked with proxies or VPNs when webscraping before? If so, any tips?
Since I only thought abouts using a VPN/proxies when formulating this comment, I already thought out the recursive architecture. It could look like this:
@imvs95 Today I took another look at the Routescanner scraper and implemented a timeout condition that stopped the scraper before the GitHub workflow timeouts after the max duration of 6 hours (commit 227a058).
Unfortunately, the API request to https://www.routescanner.com/home-vars doesn't work anymore, which we used to get a machine token. Without that we can't make requests. My Routescanner account is blocked and I might also have a few IP bans, so for me investigating this further is not only very unpractical but also way over the ethical line I want to go.
@averbraeck I'm afraid, if we want Routescanner data, we have to go talk to them.
Closing this issue, since the webscraper itself works but API access is blocked. Routescanner is only feasible with their cooperation.
(CC @ivs)
The Routescanner automated scraper failed once and twice today. Investigate why.