fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.05k stars 423 forks source link

Cast get_max_url_file_name_length result to int #267

Closed Medno closed 3 months ago

Medno commented 3 months ago

A cast to an int is missing to the result of get_max_url_file_name_length function. Since the division computation max_size_per_occurrence = max_size / number_occurrences will cast the str size to a float.

Otherwise it may breaks the append_md5_if_too_long function since str slicing is done. I spotted this during an execution with large absolute path with the following error :

  File "[...]/.venv/lib/python3.10/site-packages/newsplease/helper_classes/savepath_parser.py", line 99, in append_md5_if_too_long
    return "%s_%s" % (component[:component_size], hashlib.md5(component.encode("utf-8")).hexdigest())

Here component_size is a float and then is breaking the slicing