Closed adbar closed 1 year ago
By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue.
CLI:
htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"
Python:
Here is the debugging without min_date:
min_date
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"> DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00"> DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01 DEBUG:htmldate.core:extensive search started DEBUG:htmldate.core:looking for copyright/footer information DEBUG:htmldate.core:3 components DEBUG:htmldate.validators:no potential year: 1991-01-02 DEBUG:htmldate.validators:no potential year: 1991-01-31 DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)] DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)] DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26 '2022-07-26 00:00:00'
With min_date at "1990-01-01":
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"> DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00 DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00 DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00 DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00 DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00"> DEBUG:htmldate.validators:date not valid: 1991-01-02 DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991 DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991 DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01 DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00 DEBUG:htmldate.core:extensive search started DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01 DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01 DEBUG:htmldate.core:looking for copyright/footer information DEBUG:htmldate.core:3 components DEBUG:htmldate.validators:no potential year: 1991-01-02 DEBUG:htmldate.validators:no potential year: 1991-01-31 DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)] DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)] DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26 '2022-07-26 00:00:00'
Bug originally posted by @kinoute in https://github.com/adbar/htmldate/issues/8#issuecomment-1204211104
By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue.
CLI:
htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"
Python:
Here is the debugging without
min_date
:With
min_date
at "1990-01-01":Bug originally posted by @kinoute in https://github.com/adbar/htmldate/issues/8#issuecomment-1204211104