andreburgaud / robotspy

Alternative robots parser module for Python
https://pypi.org/project/robotspy/
MIT License
16 stars 1 forks source link

Timeout? #211

Closed ivbeg closed 1 week ago

ivbeg commented 1 month ago

Hi! Sometimes hangs forever since no timeout. Tried robots.RobotsParser.from_uri("https://earthworks.stanford.edu/robots.txt") and it hangs. Default 10 seconds timeout should be good enough or it could be great if timeout parameter could be provided to the from_uri function

andreburgaud commented 1 month ago

Hi @ivbeg, let me check and address this problem as soon as possible. Thanks a lot for raising this issue!

andreburgaud commented 1 month ago

Hi @ivbeg, please confirm which Robotspy version you are using. Thank you 😊

ivbeg commented 1 month ago

@andreburgaud Hi! Version 0.10.0

andreburgaud commented 1 month ago

Thank you @ivbeg! I'm on it. I will make sure to keep you posted.

borisiskra commented 1 month ago

Something similar happens with this robots.RobotsParser.from_uri("http://22-lr.forumactif.com/robots.txt”) but I can download the robots.txt file and if I use: robots.RobotsParser.from_string(robots_downloaded_file) it also hangs forever. Note: the file size is 20624 chars

andreburgaud commented 2 weeks ago

@ivbeg Sorry for the time it fook me to release 0.11 https://pypi.org/project/robotspy/. This should address the timeout issue, although it may require more scrutiny of the logic, especially with higher-level functions like can_fetch. The function from_uri now takes a parameter timeout set to 5 by default. Note that it is not a clock timeout per se and may take longer than you would intuitively expect. As you suggested in your first comment, you can pass a specific timeout value. For example, you could do:

robots.RobotsParser.from_uri("https://earthworks.stanford.edu/robots.txt", 2) 

To test it, you can use the following example with a dummy port (timeout set to 1):

robots.RobotsParser.from_uri("https://robotspy.org:555/robots.txt", 1)

You can find examples in the tests directory, file test_network.py.

@borisiskra, I believe the issue you raised is a bug and is not timeout-related. I suspect a parser issue and need to debug it. I will open a separate issue after confirming this is unrelated to the timeout problem @ivbeg opened.

Thank you for finding this problem. I will do my best to resolve it as soon as possible.

andreburgaud commented 1 week ago

Network timeout issue addressed in robotspy 0.11

andreburgaud commented 1 week ago

Something similar happens with this robots.RobotsParser.from_uri("http://22-lr.forumactif.com/robots.txt”) but I can download the robots.txt file and if I use: robots.RobotsParser.from_string(robots_downloaded_file) it also hangs forever. Note: the file size is 20624 chars

Fixed in robotspy version 0.12 (see issue #212)

Thank you, @borisiskra, for raising this problem ✨