calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

fix url on splash responses #49

Closed en-GB closed 2 years ago

en-GB commented 2 years ago

if splash gets redirected while rendering a page, response.url will not be updated. this breaks the zalando scrapers which use response.url do detect redirection. to fix this the splash lua script needs to return the updated URL.

se-jaeger commented 2 years ago

This does not work for me. I only get SERP pages and no PRODUCT HTMLs. A wild guess is that it breaks response.css method then.

en-GB commented 2 years ago

i cant replicate that. it might be a bug in the zalando.de scraper. Accept-Language needs to be set to de. otherwise it redirects to en.zalando.de which triggers the redirect check and aborts.

se-jaeger commented 2 years ago

Accept-Language needs to be set to de.

This should be the case, see: https://github.com/calgo-lab/green-db/blob/kvdd-splash-lua-fix/scraping/scraping/settings.py#L41-L43

However, this might be interfere if the spider is interested in getting other languages.

en-GB commented 2 years ago

sure no worries