Closed dmoklaf closed 2 years ago
Thanks, I can reproduce the bug.
@felipehertzer It seems we didn't test your PR thoroughly this summer, could you have a look at it?
@dmoklaf @felipehertzer I just made sure such errors are caught, it would be more elegant to fix them though.
Hi @adbar @dmoklaf thanks for alerting me, I've performed the fix. The problem was that the page uses a different structure for array when there is only one item, and the code was expecting the structure with many items.
My code extracts (using my own spider framework) the HTML content of this page:
https://paperswithcode.com/paper/revisiting-deep-learning-models-for-tabular/review/
parses it with LXML and calls trafilaturata with the XML content tree (the reason I do not let Trafilatura handle these tasks is that my framework handles parallelism, content encoding edge cases, and most importantly disk caching):
This crashes ONLY ON THIS WEB PAGE with this stack trace:
which indicates that, in this specific case, the "content" variable contains a string and not a dictionary