feature: supports delaying url date extraction

adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line

https://htmldate.readthedocs.io

Apache License 2.0

117 stars 26 forks source link

feature: supports delaying url date extraction #66

Closed getorca closed 1 year ago

getorca commented 1 year ago

add a feature to improve precision of dates by delaying the extraction of the URL. see (https://github.com/adbar/htmldate/issues/55)

adds the boolean parameter url_delayed to the find_date function

This is slightly hackey, but is a quick fix. A better longer term solution will be allowing the extractors to be defined in order.

adbar commented 1 year ago

Hi @getorca, could you please format the code with black?

getorca commented 1 year ago

Hi @getorca, could you please format the code with black?

should be good to go @adbar

adbar commented 1 year ago

Thanks @getorca, it looks good, I'll give it some thought and integrate the PR next week.

adbar commented 1 year ago

Hi @getorca, I just made sure the changes are easier to understand.

I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is?

getorca commented 1 year ago

Hi @getorca, I just made sure the changes are easier to understand.

I also realized that the deferred URL extraction could be moved further down in the code but I have nothing to benchmark it on, do you think it would be beneficial or do we first leave the code as it is?

Yes, it could, I moved it down as far as I'm familiar with more precise dates being extracted from.

getorca commented 1 year ago

@adbar, I also need to benchmark this when used in trafiltura, because when I pulled it into my project, It was about 30% slower than goose3. But I'm running in parallel, so I'm not sure if it's related to that, the possible me leaks in trafiltura, or a difference in extractions slowing down some of my other pipeline steps. Or the date change slowed it that much, wrote a new bench marking library over the weekend. Still need to add a func to let the extractions run in parallel to see if it is something else causing the slowdown. I'll let you know more later.

adbar commented 1 year ago

OK, so I'm leaving the PR open for now?

I usually benchmark Trafilatura without metadata extraction, it could be that portions of the code are slower but typically I'd expect it to extract more metadata than goose3. In any case, date extraction with htmldate is much faster and more accurate on my benchmark, you should be able to reproduce it (see tests/comparison.py).

BTW if you need to profile code I can really recommend pyinstrument (among others).

adbar commented 1 year ago

@getorca The PR looks ready to merge, do you confirm?

getorca commented 1 year ago

@adbar Yes, absolutely. Thanks.