adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

ValueError in xml #681

Closed Honesty-of-the-Cavernous-Tissue closed 3 months ago

Honesty-of-the-Cavernous-Tissue commented 3 months ago

trafilatura: 1.12.1

raise by: https://raw.githubusercontent.com/Honesty-of-the-Cavernous-Tissue/trafilatura/master/tests/test.html

ValueError: invalid literal for int() with base 10: '' from: https://github.com/adbar/trafilatura/blob/14c79c062bc331632de7a164477b45522b2150d0/trafilatura/xml.py#L321

adbar commented 3 months ago

I just edited your comment to replace the URL by the raw data, but I still cannot reproduce the bug with XML output, do you use particular options?

Honesty-of-the-Cavernous-Tissue commented 3 months ago

I just edited your comment to replace the URL by the raw data, but I still cannot reproduce the bug with XML output, do you use particular options?我刚刚编辑了您的评论,将 URL 替换为原始数据,但我仍然无法使用 XML 输出重现该错误,您是否使用特定选项?

sorry, i found out it's seems about the python version, my environment is 3.12.0, there's no error in 3.9.18

adbar commented 3 months ago

My bad, the bug occurs when Trafilatura is used with Python, the CLI suppresses the error.