Open clach04 opened 1 year ago
Alternative idea (todo new issue?) Separate add and generate and have workers for scrape. Using something like; https://github.com/coleifer/huey or https://github.com/rq/rq (rather than internal only queue), ideally not using Redis...
Current implementation(s) stop with traceback.
Add option to continue but log problem pages?
Seen some cases where it was an issue in trafilatura, it has issues with pages:
Traceback