Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

I want to extract all the html information. #950

Closed haolin96 closed 2 months ago

haolin96 commented 6 months ago

Hello,

I would like to extract all the html elements from the websites to store. But I can only get the text content inside them in content field. I can't find where the elements such as has been deleted. Can you please help me achieve my goal?

ohtwadi commented 6 months ago

One approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <pattern field="doc_html">.*</pattern>
  </tagger>
hadupa commented 5 months ago

I am also curious, so I will bump this. Tested the above suggested approach but it did not work due to several errors caused by deprecations. Currently trying to implement a similar solution using the RegexTagger.

https://opensource.norconex.com/importer/v3/apidocs/com/norconex/importer/handler/tagger/impl/RegexTagger.html

ohtwadi commented 5 months ago

Perhaps the DOMPreserveTransformer will be helpful.

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.