adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

Keep track of where the date has been found #124

Open adbar opened 8 months ago

adbar commented 8 months ago

So far only the logs provide info on this. It would be nicer to be able to pinpoint the type (header, element, or text) or even the exact location of the result.

To do so the location info has to be propagated back along with the result.

PetroffSky commented 3 months ago

Hello! Problem: your wonderful scanner finds dates, but sometimes they are not the ones you need, for example: on the page: https://sushi-prk.ru/kak-sdelat-chtoby-pogony-ne-gnulis/ there is no date of creation of the article and the scanner selects the date from the footer of the page: https://sushi-prk.ru/kak-sdelat-chtoby-pogony-ne-gnulis/#:~:text=2024 How to avoid this wrong date?

adbar commented 3 months ago

This enhancement is not implemented yet. If you set the extensive_search option to False you'll restrict the search to less error-prone patterns.