TeamHG-Memex / aquarium

Splash + HAProxy + Docker Compose
MIT License
198 stars 41 forks source link

Splash max-timeout doesnt seem to be used. #13

Closed nehakansal closed 6 years ago

nehakansal commented 6 years ago

Hi,

I checked that the docker-compose.yml sets the max-timeout for each splash to 3600, but I still keep getting the following error when I crawl using the Undercrawler

{"description":"Timeout exceeded rendering page", "error": 504, "type": "GlobalTimeoutError", "info":{"timeout": 30}}

From what I understand, this shouldnt happen because the timeout value to be used here should be 3600, is that correct?

Thanks, Neha.

lopuhin commented 6 years ago

max timeout is maximal allowed timeout value, but default timeout stays the same at 30 s I'm afraid, I don't see a timeout set when we make splash request: https://github.com/TeamHG-Memex/undercrawler/blob/master/undercrawler/spiders.py. I think raising timeout in undercrawler to 90 seconds makes sense (as it's default max timeout for recent splash versions), and also it might be useful to expose it via settings to allow making it larger.

Closing this issue here as it's related to undercrawler.

nehakansal commented 6 years ago

Okay, thanks for clarifying that. Adding it as a setting to Undercrawler would definitely be helpful. Do you plan to add it as an issue to Undercrawler? In the meantime, is adding timeout to the splash_args in the Undercrawler code the only fix if I want to run some tests with higher timeout value? Thank you.

lopuhin commented 6 years ago

@nehakansal I think that adding this to undercrawler code is the only way to fix this, and I'm not working on it at the moment, but I'll be happy to merge a pull request and help with implementation. Timeout value in splash API is documented here http://splash.readthedocs.io/en/stable/api.html#execute and this is the place to add it: https://github.com/TeamHG-Memex/undercrawler/blob/7d4f21520a9770eb94641420625b927d92537a29/undercrawler/spiders.py#L63-L69

nehakansal commented 6 years ago

Thank you. That's what I kind of figured. I added the timeout argument, in the code, locally for now.