adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
117 stars 26 forks source link

Inaccurate Extraction of Original and Updated Publication Dates #119

Closed 0xFlo closed 6 months ago

0xFlo commented 6 months ago

Description of the Issue: When using the htmldate library to extract both the original publication date and the most recent update date from web pages, the function find_date returns the same date for both, even though the HTML source of the pages clearly contains different dates for the original publication and last modification.

Steps to Reproduce:

  1. Use the find_date function from the htmldate library to extract the publication date from a URL.
  2. Call find_date twice for each URL, first with original_date=True to get the original publication date, and then with original_date=False to get the updated date.
  3. Compare the results.

Expected Behavior: The function should return distinct dates for the original publication and the most recent update (if available), based on the webpage's metadata.

Actual Behavior: The function returns the same date for both the original publication and the most recent update.

Example Code:

from htmldate import find_date

# Example URL
url = 'https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/'

# Attempt to extract original publication date
original_date = find_date(url, original_date=True)

# Attempt to extract most recent update date
updated_date = find_date(url, original_date=False)

print(f'Original Date: {original_date}')
print(f'Updated Date: {updated_date}')

Possible Causes: The library might not be parsing certain HTML meta tags correctly. There could be an issue with the heuristic approach used to differentiate between the dates. I hope this information helps in diagnosing and resolving the issue. Thank you for your assistance!

0xFlo commented 6 months ago

When opening the source of the page you can see that there are two dates:

<meta property="article:published_time" content="2020-07-21T00:17:28+00:00" />
<meta property="article:modified_time" content="2021-04-06T06:32:14+00:00" />

However the script only extracts the original date and ignores the modified date

Original publication date for https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/: 2020-07-21
Most recent update date for https://insighttimer.com/blog/what-happens-when-you-open-your-third-eye/: 2020-07-21
adbar commented 6 months ago

Hi @0xFlo, thanks for your feedback, there seems to be a problem with article:modified_time indeed.

My guess is that the property needs to be added to the existing list: https://github.com/adbar/htmldate/blob/b29677683f1a311367114a636dfcd76725b663b7/htmldate/core.py#L84

adbar commented 6 months ago

It was something else and will be fixed by the PR above.