adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

error: redefinition of group name 'm' as group 5; was group 2 at position 116 #54

Closed kinoute closed 2 years ago

kinoute commented 2 years ago

Hello there,

Thanks for this great project! I encountered a problem while crawling different websites and trying to extract dates with this package. Especially on this URL: https://osmh.dev

Here is the error using iPython and Python 3.8.12:

# works
In [3]: from htmldate import find_date

In [4]: find_date("https://osmh.dev")
Out[4]: '2020-11-29'

# doesn't work
In [6]: find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

The last example throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-6-9988648ad55b> in <module>
----> 1 find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
    653
    654     # try time elements
--> 655     time_result = examine_time_elements(
    656         search_tree, outputformat, extensive_search, original_date, min_date, max_date
    657     )

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in examine_time_elements(tree, outputformat, extensive_search, original_date, min_date, max_date)
    389                         return attempt
    390                 else:
--> 391                     reference = compare_reference(reference, elem.get('datetime'), outputformat, extensive_search, original_date, min_date, max_date)
    392                     if reference > 0:
    393                         break

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in compare_reference(reference, expression, outputformat, extensive_search, original_date, min_date, max_date)
    300     attempt = try_expression(expression, outputformat, extensive_search, min_date, max_date)
    301     if attempt is not None:
--> 302         return compare_values(reference, attempt, outputformat, original_date)
    303     return reference
    304

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/validators.py in compare_values(reference, attempt, outputformat, original_date)
    110 def compare_values(reference, attempt, outputformat, original_date):
    111     """Compare the date expression to a reference"""
--> 112     timestamp = time.mktime(datetime.datetime.strptime(attempt, outputformat).timetuple())
    113     if original_date is True and (reference == 0 or timestamp < reference):
    114         reference = timestamp

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
    566     """Return a class cls instance based on the input string and the
    567     format string."""
--> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    569     tzname, gmtoff = tt[-2:]
    570     args = tt[:6] + (fraction,)

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime(data_string, format)
    331         if not format_regex:
    332             try:
--> 333                 format_regex = _TimeRE_cache.compile(format)
    334             # KeyError raised when a bad format is found; can be specified as
    335             # \\, in which case it was a stray % but with a space after it

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in compile(self, format)
    261     def compile(self, format):
    262         """Return a compiled re object for the format string."""
--> 263         return re_compile(self.pattern(format), IGNORECASE)
    264
    265 _cache_lock = _thread_allocate_lock()

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in compile(pattern, flags)
    250 def compile(pattern, flags=0):
    251     "Compile a regular expression pattern, returning a Pattern object."
--> 252     return _compile(pattern, flags)
    253
    254 def purge():

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in _compile(pattern, flags)
    302     if not sre_compile.isstring(pattern):
    303         raise TypeError("first argument must be string or compiled pattern")
--> 304     p = sre_compile.compile(pattern, flags)
    305     if not (flags & DEBUG):
    306         if len(_cache) >= _MAXCACHE:

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in parse(str, flags, state)
    946
    947     try:
--> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
    949     except Verbose:
    950         # the VERBOSE flag was switched on inside the pattern.  to be

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
    441     start = source.tell()
    442     while True:
--> 443         itemsappend(_parse(source, state, verbose, nested + 1,
    444                            not nested and not items))
    445         if not sourcematch("|"):

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
    829                     group = state.opengroup(name)
    830                 except error as err:
--> 831                     raise source.error(err.msg, len(name) + 1) from None
    832             sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
    833                            not (del_flags & SRE_FLAG_VERBOSE))

error: redefinition of group name 'm' as group 5; was group 2 at position 116
adbar commented 2 years ago

Hi @kinoute, I cannot reproduce the bug, I think it has to do with your setting. The error log hints at another function also named strptime which interferes with datetime's strptime function.

kinoute commented 2 years ago

Here is a one-liner to reproduce the error using vanilla official Python docker image:

docker run --rm python:3.8.12 /bin/bash -c "pip3 install htmldate; python3 -c \"from htmldate import find_date; find_date('https://osmh.dev', extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')\""
adbar commented 2 years ago

Thanks, I can see the problem now.

adbar commented 2 years ago

@kinoute it's fixed, I will ship a new release very soon.

Please not that changing extraction granularity affects the result for the case you mention: