Closed haolin96 closed 2 months ago
One approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
<restrictTo field="document.contentType">text/html</restrictTo>
<pattern field="doc_html">.*</pattern>
</tagger>
I am also curious, so I will bump this. Tested the above suggested approach but it did not work due to several errors caused by deprecations. Currently trying to implement a similar solution using the RegexTagger.
Perhaps the DOMPreserveTransformer will be helpful.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello,
I would like to extract all the html elements from the websites to store. But I can only get the text content inside them in content field. I can't find where the elements such as