EwoutH / shipping-data

A public collection of shipping data from South Africa to The Netherlands
GNU General Public License v3.0
1 stars 5 forks source link

Investigate Routescanner CI failure #52

Closed EwoutH closed 1 year ago

EwoutH commented 1 year ago

The Routescanner automated scraper failed once and twice today. Investigate why.

EwoutH commented 1 year ago

It has been failing since 2022-10-26, the last successful run was 2022-10-25.

After testing locally it seems that it doesn't like running headless anymore. Setting headless False fixes the issue locally, let's see if it also works in CI.

EwoutH commented 1 year ago

CI doesn't like to run headless. Probably a software limitation of the CI environment. So two options:

  1. Figure out how to run non-headless in CI
  2. Fix the headless option

Both have their challenges, but 2 can be quite difficult since headless works so you can't use it to debug it.

EwoutH commented 1 year ago

The main problem is that Routescanner starts denying HTTP requests at some point. From that point, we can either wait a very long time (keeping a CI runner busy, and possibly time-out), try to appear as another instance (using a VPN or proxy) or use a recursive action that spins up a new runner when request starts to be denied.

We basically need something that makes sure another machine token is returned (which happens here).

https://github.com/EwoutH/shipping-data/blob/69812633959f4f080efa7087d606dc81c2e7d479/webscrapers/routescanner_automated_v2.py#L44-L58

Using proxies could be the simplest implementation, since no new CI runners have to be spun up and thus no recursiveness is needed. @averbraeck, have you worked with proxies or VPNs when webscraping before? If so, any tips?

Since I only thought abouts using a VPN/proxies when formulating this comment, I already thought out the recursive architecture. It could look like this:

EwoutH commented 1 year ago

@imvs95 Today I took another look at the Routescanner scraper and implemented a timeout condition that stopped the scraper before the GitHub workflow timeouts after the max duration of 6 hours (commit 227a058).

Unfortunately, the API request to https://www.routescanner.com/home-vars doesn't work anymore, which we used to get a machine token. Without that we can't make requests. My Routescanner account is blocked and I might also have a few IP bans, so for me investigating this further is not only very unpractical but also way over the ethical line I want to go.

@averbraeck I'm afraid, if we want Routescanner data, we have to go talk to them.

EwoutH commented 1 year ago

Closing this issue, since the webscraper itself works but API access is blocked. Routescanner is only feasible with their cooperation.

(CC @ivs)