daijro / hrequests

🚀 Web scraping for humans
https://daijro.gitbook.io/hrequests/
MIT License
654 stars 39 forks source link

Handling AttributeErrors when parsing many different URLs #16

Closed ThinksFast closed 1 year ago

ThinksFast commented 1 year ago

Hi, nice work on this library. I'm trying to parse a bunch of pages with it. But I'm running into issues where fetching content that doesn't exist throws an attribute error. Here's an example:

resp = hrequests.get("some_url")
data = {}

try:
    data['url'] = resp.url
    data["canonical"] = resp.html.find("link[@rel='canonical']").url
    data["title"] = resp.html.find("title").text
    data["meta_description"] = resp.html.find("meta[name='description']").text

except AttributeError:
    pass

Because I'm calling .text and .url on these elements, if any elements don't exist in the HTML response, the code throws an AttributeError: 'NoneType' object has no attribute 'text' and the data object will only have content prior to the error, missing any other valid elements. So for example, if there is no <title> element, but the other 3 elements do exist, the data dict will only contain the url and canonical values, it won't have the meta_description.

The attribute error makes sense, but when scraping content at scale, there's going to be errors, edge cases and missing contents. I don't see a way to handle this gracefully. I'm fine having an empty string if the value is missing, or a None type value. Is there a better way to handle this? I can remove the .url and .text properties, but I'd still have to handle it downstream with a bunch of if/else statements, and I'd prefer to just parse out the content early in the pipeline.

daijro commented 1 year ago

Thanks for pointing this out, would it be best to use handle exceptions through a passed exception_handler function (similar to imap)? Just a thought.

I plan on redesigning the html parser this weekend to use selectolax instead of pyquery. I'll be sure to consider exception handling. Thanks!

ThinksFast commented 1 year ago

According to the docs, it looks like the custom exception_handler is for requests, not responses. Will it extend into parsing responses if the connection is closed?

daijro commented 1 year ago

Hello!

As of v0.8.0, HTML attributes should now return a None value instead of an AttributeError when an unknown attribute is called.

daijro commented 1 year ago

I think I may have misinterpreted your comment. Would it be an appropriate enhancement to add a custom exception_handler (independent of the requests exception_handler) for handling missing HTML attributes, for better edge case handling at larger at scale? Or would you say leaving it as returning None is fine?

Just an idea, feel free to give suggestions

ThinksFast commented 1 year ago

Thanks! For me, None getting returned is fine, at least the script will keep parsing and I could build logic downstream to handle it.

But an exception handler would be a really nice addition, particularly for broad crawls where there is more variability in the responses. We could assign values for specific types of failures and keep parsing... no body in the response, assign None to that element. status_code = 999... retry after 30 seconds. DivisonByZero error... assign 0 to that element, etc.

Something the scrapy package includes is container classes for scraped items (Items and ItemLoaders), which allow for functions that can be applied to data on input and output. This is useful for processing "dirty" DOM elements, setting custom serialization values, etc. There's probably some good things to learn from that type of pattern. I'm not sure if Pydantic or MsgSpec would be good candidates for this, but it's worth exploring, integrating a data validation library seems like a natural fit to a package like hrequests.