Investigate Routescanner CI failure

EwoutH commented 1 year ago

The Routescanner automated scraper failed once and twice today. Investigate why.

EwoutH commented 1 year ago

It has been failing since 2022-10-26, the last successful run was 2022-10-25.

After testing locally it seems that it doesn't like running headless anymore. Setting headless False fixes the issue locally, let's see if it also works in CI.

EwoutH commented 1 year ago

CI doesn't like to run headless. Probably a software limitation of the CI environment. So two options:

Figure out how to run non-headless in CI
Fix the headless option

Both have their challenges, but 2 can be quite difficult since headless works so you can't use it to debug it.

EwoutH commented 1 year ago

The main problem is that Routescanner starts denying HTTP requests at some point. From that point, we can either wait a very long time (keeping a CI runner busy, and possibly time-out), try to appear as another instance (using a VPN or proxy) or use a recursive action that spins up a new runner when request starts to be denied.

We basically need something that makes sure another machine token is returned (which happens here).

https://github.com/EwoutH/shipping-data/blob/69812633959f4f080efa7087d606dc81c2e7d479/webscrapers/routescanner_automated_v2.py#L44-L58

Using proxies could be the simplest implementation, since no new CI runners have to be spun up and thus no recursiveness is needed. @averbraeck, have you worked with proxies or VPNs when webscraping before? If so, any tips?

Since I only thought abouts using a VPN/proxies when formulating this comment, I already thought out the recursive architecture. It could look like this:

An initial job would run the scraper, until no valid data is coming back anymore (3 successive failures)
The script will, before exiting:
- Upload the list of port combination, number where it is in that list and data so far collected in a Pickle as a GitHub Actifact
- Set an environment variable True or False whether the script was finished or not
Based on this environment variable, the script will either run again (downloading the Artifact and continuing until it's finished or doesn't get valid data), or a new job will start with a new script that combines all the data to a single dataframe, and upload that.

EwoutH commented 1 year ago

@imvs95 Today I took another look at the Routescanner scraper and implemented a timeout condition that stopped the scraper before the GitHub workflow timeouts after the max duration of 6 hours (commit 227a058).

Unfortunately, the API request to https://www.routescanner.com/home-vars doesn't work anymore, which we used to get a machine token. Without that we can't make requests. My Routescanner account is blocked and I might also have a few IP bans, so for me investigating this further is not only very unpractical but also way over the ethical line I want to go.

@averbraeck I'm afraid, if we want Routescanner data, we have to go talk to them.

EwoutH commented 1 year ago

Closing this issue, since the webscraper itself works but API access is blocked. Routescanner is only feasible with their cooperation.

(CC @ivs)

EwoutH / shipping-data

Investigate Routescanner CI failure #52