Open naktinis opened 3 months ago
@naktinis Yes there is a reason, text within div (and nothing else) is generally undesirable. It is always a tradeoff between precision and recall.
The easiest way I see is to add "div"
manually in settings.TAG_CATALOG
and re-install the package locally, it should be propagated to the extractors. Does that solve your problem?
First noticed this when trying to extract text from 1Password documentation and realized that code blocks are not being extracted.
Then reduced it to a minimal reproducible example.
Version tested:
1.10.0
.HTML:
Extract call:
Output (does not include "This is a very important..."):
Is there a reason why the text within a
<div>
block is being ignored, and would there be any way to change this behavior? Ideally maybe wouldn't even need to favor recall.