`find_date` doesn't extract `%D %b %Y` formatted dates in free text

adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line

https://htmldate.readthedocs.io

Apache License 2.0

118 stars 26 forks source link

`find_date` doesn't extract `%D %b %Y` formatted dates in free text #67

Closed k-sareen closed 1 year ago

k-sareen commented 1 year ago

For the following MWE:

from htmldate import find_date

print(find_date("<html><body>Wed, 19 Oct 2022 14:24:05 +0000</body></html>"))

htmldate outputs 2022-01-01 instead of the expected 2022-10-19.

I've traced the execution of the above call and I believe it is the search_page function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the 2022 part of the date string (which autocompletes the rest to 1st Jan).

I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.

adbar commented 1 year ago

Hi @k-sareen, thanks for your feedback.

It's not a bug since htmldate cannot extract dates from free text, in this case it looks simple but try this on a 10000 character long string where you don't know where the date is... For this reason, the package targets metadata or HTML fields and uses free text as a last resort. All it can do here is return the year as an approximation. But your example shows that it may be useful to look around the year info and maybe pass this string to the pipeline. I'm going to think about it.

adbar commented 1 year ago

Note about a quick fix: the issue can be resolved as follows, but the code gets slower and would have to be tested carefully. It can lead to false positives by extracting any date mentioned in the text without disambiguation or further clue about its relevance:

Changes in core.py:

imports: from .extractors import regex_parse

beginning of search_page() function:

dateobject = regex_parse(htmlstring)
if (
    date_validator(dateobject, outputformat, earliest=min_date, latest=max_date)
    is True
):
    try:
        LOGGER.debug("custom parse result: %s", dateobject)
        return dateobject.strftime(outputformat)  # type: ignore
    except ValueError as err:
        LOGGER.error("value error during conversion: %s %s", string, err)

k-sareen commented 1 year ago

Ah right. I apologize, I seem to have misunderstood what kinds of dates htmldate can handle. Thank you for your quick response. Would you have a recommendation of a library that can handle dates in free form text? Unfortunately I can't control what kind of dateformat I receive from articles (why can't everyone just use ISO :cry:).

adbar commented 1 year ago

No problem, I could add this functionality to the library but I need some time to test it. Just out of curiosity: Which languages are you interested in?

k-sareen commented 1 year ago

I'm working with English text/articles only. Though I think you're right that this is a bit of a slippery slope as it may potentially catch dates that are mentioned in the prose but are not the actual article date. I think it might be best to keep your library simple and I'll try and get around this edge case myself. Thank you again for your great work and for your insight!

P.S. Should I close the issue?

adbar commented 1 year ago

Thanks for your feedback, you can leave the issue open, I'll think about it and close it if it goes beyond the scope of the library.

adbar commented 1 year ago

Full text search is now supported and your example above works.