Closed ThinksFast closed 1 year ago
Thanks for pointing this out, would it be best to use handle exceptions through a passed exception_handler
function (similar to imap)? Just a thought.
I plan on redesigning the html parser this weekend to use selectolax instead of pyquery. I'll be sure to consider exception handling. Thanks!
According to the docs, it looks like the custom exception_handler
is for requests
, not responses
. Will it extend into parsing responses if the connection is closed?
Hello!
As of v0.8.0, HTML attributes should now return a None
value instead of an AttributeError
when an unknown attribute is called.
I think I may have misinterpreted your comment. Would it be an appropriate enhancement to add a custom exception_handler
(independent of the requests exception_handler
) for handling missing HTML attributes, for better edge case handling at larger at scale? Or would you say leaving it as returning None
is fine?
Just an idea, feel free to give suggestions
Thanks! For me, None
getting returned is fine, at least the script will keep parsing and I could build logic downstream to handle it.
But an exception handler would be a really nice addition, particularly for broad crawls where there is more variability in the responses. We could assign values for specific types of failures and keep parsing... no body
in the response, assign None
to that element. status_code = 999
... retry after 30 seconds. DivisonByZero
error... assign 0 to that element, etc.
Something the scrapy
package includes is container classes for scraped items (Items
and ItemLoaders
), which allow for functions that can be applied to data on input and output. This is useful for processing "dirty" DOM elements, setting custom serialization values, etc. There's probably some good things to learn from that type of pattern. I'm not sure if Pydantic or MsgSpec would be good candidates for this, but it's worth exploring, integrating a data validation library seems like a natural fit to a package like hrequests
.
Hi, nice work on this library. I'm trying to parse a bunch of pages with it. But I'm running into issues where fetching content that doesn't exist throws an attribute error. Here's an example:
Because I'm calling
.text
and.url
on these elements, if any elements don't exist in the HTML response, the code throws anAttributeError: 'NoneType' object has no attribute 'text'
and thedata
object will only have content prior to the error, missing any other valid elements. So for example, if there is no<title>
element, but the other 3 elements do exist, thedata
dict will only contain theurl
andcanonical
values, it won't have themeta_description
.The attribute error makes sense, but when scraping content at scale, there's going to be errors, edge cases and missing contents. I don't see a way to handle this gracefully. I'm fine having an empty string if the value is missing, or a
None
type value. Is there a better way to handle this? I can remove the.url
and.text
properties, but I'd still have to handle it downstream with a bunch of if/else statements, and I'd prefer to just parse out the content early in the pipeline.