Closed 0xFlo closed 6 months ago
When opening the source of the page you can see that there are two dates:
<meta property="article:published_time" content="2020-07-21T00:17:28+00:00" />
<meta property="article:modified_time" content="2021-04-06T06:32:14+00:00" />
However the script only extracts the original date and ignores the modified date
Original publication date for https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/: 2020-07-21
Most recent update date for https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/: 2020-07-21
Hi @0xFlo, thanks for your feedback, there seems to be a problem with article:modified_time
indeed.
My guess is that the property needs to be added to the existing list: https://github.com/adbar/htmldate/blob/b29677683f1a311367114a636dfcd76725b663b7/htmldate/core.py#L84
It was something else and will be fixed by the PR above.
Description of the Issue: When using the htmldate library to extract both the original publication date and the most recent update date from web pages, the function find_date returns the same date for both, even though the HTML source of the pages clearly contains different dates for the original publication and last modification.
Steps to Reproduce:
Expected Behavior: The function should return distinct dates for the original publication and the most recent update (if available), based on the webpage's metadata.
Actual Behavior: The function returns the same date for both the original publication and the most recent update.
Example Code:
Possible Causes: The library might not be parsing certain HTML meta tags correctly. There could be an issue with the heuristic approach used to differentiate between the dates. I hope this information helps in diagnosing and resolving the issue. Thank you for your assistance!