Remove sys.exit calls from crawlers

m-appel commented 1 year ago

Crawlers should not use sys.exit() in case of a fatal error, since this will crash the entire create_db.py script. Instead, an exception should be raised, which can be properly caught and processed.

Although some crawlers are depending on others and if they crash the dump is basically useless, this should not be the default assumption and we check the logs after each dump anyways.

Affected crawlers:

apnic.eyeball
bgpkit.as2rel
bgpkit.peerstats
bgpkit.pfx2asn
bgptools.anycat_prefixes
bgptools.as_names
bgptools.tags
caida.asrank
cisco.umbrella_top1M
citizenlab.urldb
cloudflare.ranking_bucket
cloudflare.top100
emileaben.as_names
example.crawler
ihr.country_dependency
inetintel.as_org
manrs.members
nro.delegated_stats
ripe.as_names
ripe.atlas_probes
stanford.asdb
tranco.top1M

KiranSatyaRaj commented 10 months ago

Hi @m-appel I'm willing work on this issue, I've looked into to the crawler files, as you've mentioned they don't handle the exceptions instead they use sys.exit() if status_code isn't 200 Am I supposed to catch HTTPError? and what needs to be done further If I catch the error other than exiting with error message, can we please discuss this?

m-appel commented 10 months ago

Thanks for your interest in IYP! The simplest way is to not catch anything and replace sys.exit() calls that are not in an exception handler (example) with a custom exception.

We do not want the crawler to continue in cases where we currently use sys.exit, so the crawler should actually not handle the exception, but let it pass through to the create_db script, where it will be handled and printed to our log where we see it.

I think most of these are HTTP request errors, so you could just use raise_for_status() on the response object. Alternatively, you can take inspiration from Mohamed's recent additions and define custom exceptions, but either way is fine.

KiranSatyaRaj commented 10 months ago

Hi @m-appel - Thanks for responding back. I just need some clarification. So we rather raise the exception in the crawler file so that it passes through to the create_db script where the exception is handled, and omit usage of sys.exit. I can raise the exception either by using raise_for_status() or define custion exceptions, is that what you've meant? and given the alternative way to raise exceptions shall I just stick to just one way across all crawler files for consistency purposes? Also there the multiple files that are needed to be modified so, I think it's better to raise one pull request for each file, what do you think?

m-appel commented 10 months ago

So we rather raise the exception in the crawler file so that it passes through to the create_db script where the exception is handled, and omit usage of sys.exit. I can raise the exception either by using raise_for_status() or define custion exceptions, is that what you've meant?

Yes, exactly.

and given the alternative way to raise exceptions shall I just stick to just one way across all crawler files for consistency purposes?

It's probably better to be consistent, so let's just use custom exceptions I guess.

I think it's a good time to define the exception types only once in iyp/__init__.py and import them in each crawler. You can take inspiration from here and move the existing exceptions as well.

Also there the multiple files that are needed to be modified so, I think it's better to raise one pull request for each file, what do you think?

That would be a lot of pull requests :) Since the changes in each file will be very basic, please create just one PR. The existing crawlers will not be changed anyways, so the chance of creating conflicts are very low. I will assign this issue to you for now, but there is no pressure.

Finally, we are going into the year-end celebration here in Japan, so my replies might be slow for the next week!

KiranSatyaRaj commented 10 months ago

Hi @m-appel - There are quite a good amount of import issues in the crawler files, I'm resolving that as well,. this an example of the change I made: before -> from iyp.crawlers.alice_lg import Crawler after -> from . import Crawler, since we are importing from __init__.py of same directory(alice_lg.py) I'm using relative imports to resolve these, shall I go ahead with it? If yes, should I include these in the same pull request?

And here's how I changed the error conditional statement in apnic.eyeball, please take a look and tell me if this is how I am supposed to proceed? Before:

if req.status_code != 200:
                 sys.exit('Error while fetching data for ' + cc)

After:

if req.status_code != 200:
                raise RequestStatusError(f'Error while fetching data for {cc}')

m-appel commented 10 months ago

Hi, we use absolute imports on purpose since they are more explicit (and actually recommended by PEP 8), so please do not change them.

Your example fix looks good to me, thanks.

KiranSatyaRaj commented 10 months ago

Hi @m-appel - here in pch.show_bpg_parser, would it be optimal to use AddressValueError in this case or should I leave it as it is? as this file wasn't mentioned in the issue's description.


if af not in (4, 6):
            logging.error(f'Invalid address family specified: {af}')
            sys.exit('Invalid address family specified.')

m-appel commented 10 months ago

Ah, I think I did not mention it in the list since it's used statically by the crawler, i.e., this error can only occur during development. But let's be consistent and change it to an exception as well. Good catch, thanks!

KiranSatyaRaj commented 10 months ago

Ah, I think I did not mention it in the list since it's used statically by the crawler, i.e., this error can only occur during development. But let's be consistent and change it to an exception as well. Good catch, thanks!

Got it @m-appel. In that case I've added custom error for ConnectionError as well

InternetHealthReport / internet-yellow-pages

Remove sys.exit calls from crawlers #70