clach04 / whatabagacack

❓👜💩❗ Experimental (incomplete) Python Wallabag API Server
GNU Affero General Public License v3.0
1 stars 0 forks source link

handle errors during scrape #10

Open clach04 opened 1 year ago

clach04 commented 1 year ago

Current implementation(s) stop with traceback.

Add option to continue but log problem pages?

Seen some cases where it was an issue in trafilatura, it has issues with pages:

Traceback

Traceback (most recent call last):
  File "C:\code\py\w2d\py310venv\lib\site-packages\trafilatura\xml.py", line 194, in replace_element_text
    element.text = ''.join(['[', element.text, ']', '(', element.get('target'), ')'])
TypeError: sequence item 1: expected str instance, NoneType found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\code\py\whatabagacack\web2epub.py", line 141, in <module>
    sys.exit(main())
  File "C:\code\py\whatabagacack\web2epub.py", line 94, in main
    result_metadata = w2d.dump_url(url, output_format=w2d.FORMAT_EPUB)  # TODO more options (e.g. skip readability, etc.)
  File "c:\code\py\w2d\w2d\__init__.py", line 317, in dump_url
    result_metadata = process_page(html_text, url=url, output_format=output_format)
  File "c:\code\py\w2d\w2d\__init__.py", line 220, in process_page
    doc_metadata = trafilatura.bare_extraction(content, include_links=True, include_formatting=True, include_images=True, include_tables=True, with_metadata=True,
 url=url)
  File "C:\code\py\w2d\py310venv\lib\site-packages\trafilatura\core.py", line 743, in bare_extraction
    docmeta['text'] = xmltotxt(postbody, include_formatting, include_links)
  File "C:\code\py\w2d\py310venv\lib\site-packages\trafilatura\xml.py", line 236, in xmltotxt
    merge_with_parent(element, include_formatting, include_links)
  File "C:\code\py\w2d\py310venv\lib\site-packages\trafilatura\xml.py", line 213, in merge_with_parent
    full_text = replace_element_text(element, include_formatting, include_links)
  File "C:\code\py\w2d\py310venv\lib\site-packages\trafilatura\xml.py", line 197, in replace_element_text
    element.text = ''.join(['[', element.text, ']'])
TypeError: sequence item 1: expected str instance, NoneType found
clach04 commented 2 months ago

Alternative idea (todo new issue?) Separate add and generate and have workers for scrape. Using something like; https://github.com/coleifer/huey or https://github.com/rq/rq (rather than internal only queue), ideally not using Redis...