Closed m-appel closed 10 months ago
Hi @m-appel
I'm willing work on this issue, I've looked into to the crawler files, as you've mentioned they don't handle the exceptions instead they use sys.exit()
if status_code isn't 200
Am I supposed to catch HTTPError
?
and what needs to be done further If I catch the error other than exiting with error message, can we please discuss this?
Thanks for your interest in IYP!
The simplest way is to not catch anything and replace sys.exit()
calls that are not in an exception handler (example) with a custom exception.
We do not want the crawler to continue in cases where we currently use sys.exit
, so the crawler should actually not handle the exception, but let it pass through to the create_db
script, where it will be handled and printed to our log where we see it.
I think most of these are HTTP request errors, so you could just use raise_for_status() on the response object. Alternatively, you can take inspiration from Mohamed's recent additions and define custom exceptions, but either way is fine.
Hi @m-appel -
Thanks for responding back. I just need some clarification.
So we rather raise the exception in the crawler file so that it passes through to the create_db
script where the exception is handled, and omit usage of sys.exit
. I can raise the exception either by using raise_for_status()
or define custion exceptions, is that what you've meant? and given the alternative way to raise exceptions shall I just stick to just one way across all crawler files for consistency purposes?
Also there the multiple files that are needed to be modified so, I think it's better to raise one pull request for each file, what do you think?
So we rather raise the exception in the crawler file so that it passes through to the
create_db
script where the exception is handled, and omit usage ofsys.exit
. I can raise the exception either by usingraise_for_status()
or define custion exceptions, is that what you've meant?
Yes, exactly.
and given the alternative way to raise exceptions shall I just stick to just one way across all crawler files for consistency purposes?
It's probably better to be consistent, so let's just use custom exceptions I guess.
I think it's a good time to define the exception types only once in iyp/__init__.py
and import them in each crawler. You can take inspiration from here and move the existing exceptions as well.
Also there the multiple files that are needed to be modified so, I think it's better to raise one pull request for each file, what do you think?
That would be a lot of pull requests :) Since the changes in each file will be very basic, please create just one PR. The existing crawlers will not be changed anyways, so the chance of creating conflicts are very low. I will assign this issue to you for now, but there is no pressure.
Finally, we are going into the year-end celebration here in Japan, so my replies might be slow for the next week!
Hi @m-appel -
There are quite a good amount of import issues in the crawler files, I'm resolving that as well,.
this an example of the change I made:
before -> from iyp.crawlers.alice_lg import Crawler
after -> from . import Crawler
, since we are importing from __init__.py
of same directory(alice_lg.py
)
I'm using relative imports to resolve these, shall I go ahead with it?
If yes, should I include these in the same pull request?
And here's how I changed the error conditional statement in apnic.eyeball
, please take a look and tell me if this is how I am supposed to proceed?
Before:
if req.status_code != 200:
sys.exit('Error while fetching data for ' + cc)
After:
if req.status_code != 200:
raise RequestStatusError(f'Error while fetching data for {cc}')
Hi, we use absolute imports on purpose since they are more explicit (and actually recommended by PEP 8), so please do not change them.
Your example fix looks good to me, thanks.
Hi @m-appel -
here in pch.show_bpg_parser
, would it be optimal to use AddressValueError
in this case or should I leave it as it is?
as this file wasn't mentioned in the issue's description.
if af not in (4, 6):
logging.error(f'Invalid address family specified: {af}')
sys.exit('Invalid address family specified.')
Ah, I think I did not mention it in the list since it's used statically by the crawler, i.e., this error can only occur during development. But let's be consistent and change it to an exception as well. Good catch, thanks!
Ah, I think I did not mention it in the list since it's used statically by the crawler, i.e., this error can only occur during development. But let's be consistent and change it to an exception as well. Good catch, thanks!
Got it @m-appel. In that case I've added custom error for ConnectionError
as well
Crawlers should not use
sys.exit()
in case of a fatal error, since this will crash the entirecreate_db.py
script. Instead, an exception should be raised, which can be properly caught and processed.Although some crawlers are depending on others and if they crash the dump is basically useless, this should not be the default assumption and we check the logs after each dump anyways.
Affected crawlers: