Closed k-sareen closed 1 year ago
Hi @k-sareen, thanks for your feedback.
It's not a bug since htmldate
cannot extract dates from free text, in this case it looks simple but try this on a 10000 character long string where you don't know where the date is... For this reason, the package targets metadata or HTML fields and uses free text as a last resort.
All it can do here is return the year as an approximation. But your example shows that it may be useful to look around the year info and maybe pass this string to the pipeline. I'm going to think about it.
Note about a quick fix: the issue can be resolved as follows, but the code gets slower and would have to be tested carefully. It can lead to false positives by extracting any date mentioned in the text without disambiguation or further clue about its relevance:
Changes in core.py
:
from .extractors import regex_parse
search_page()
function:
dateobject = regex_parse(htmlstring)
if (
date_validator(dateobject, outputformat, earliest=min_date, latest=max_date)
is True
):
try:
LOGGER.debug("custom parse result: %s", dateobject)
return dateobject.strftime(outputformat) # type: ignore
except ValueError as err:
LOGGER.error("value error during conversion: %s %s", string, err)
Ah right. I apologize, I seem to have misunderstood what kinds of dates htmldate
can handle. Thank you for your quick response. Would you have a recommendation of a library that can handle dates in free form text? Unfortunately I can't control what kind of dateformat I receive from articles (why can't everyone just use ISO :cry:).
No problem, I could add this functionality to the library but I need some time to test it. Just out of curiosity: Which languages are you interested in?
I'm working with English text/articles only. Though I think you're right that this is a bit of a slippery slope as it may potentially catch dates that are mentioned in the prose but are not the actual article date. I think it might be best to keep your library simple and I'll try and get around this edge case myself. Thank you again for your great work and for your insight!
P.S. Should I close the issue?
Thanks for your feedback, you can leave the issue open, I'll think about it and close it if it goes beyond the scope of the library.
Full text search is now supported and your example above works.
For the following MWE:
htmldate
outputs2022-01-01
instead of the expected2022-10-19
.I've traced the execution of the above call and I believe it is the
search_page
function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the2022
part of the date string (which autocompletes the rest to 1st Jan).I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.