Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Extract tags from a field using DOM tagger #571

Closed niozasg closed 5 years ago

niozasg commented 5 years ago

Hi, I am splitting an HTML document using DOMSplitter with img selector to extract what is in tags. After that I am trying to get some attributes like "alt:" and "src:" from "content" field (where all extracted img tag is in) and split it to different fields.

DOMTagger and "fromField" seems to be doing that and I used it but I get a message: "class com.norconex.importer.handler.tagger.impl.DOMTagger handler does not apply to: "..image_id". Also before that at debugger i get this message : "DEBUG [ContentTypeDetector] Detected "text/plain" content-type for: ... "

I suspect that DOMsplitter creates a document with "text/plain" content-type and then i can't run on that because of default restrictions if no matching to " CommonRestrictions.domContentTypes()" I also noticed that when use "img" selector to the DOMsplitter it returns tag int the "content" field but it does not contain the closing "/>" and that may be the reason that DOMspliter assings "text/plain" content type.

Can you suggest me a solution to this? Thanks

essiembre commented 5 years ago

When not explicitly provided (e.g., from the HTTP response), the content type is detected. What you are using it on is no longer valid HTML at that stage (no <html>, <body>, etc.) so that is probably why.

The easiest would be to set your own restrictions. You may want to add "text/plain". Another option is to use the ReplaceTagger instead.

If none of these work for you, please share your config along with a sample document.

niozasg commented 5 years ago

Thanks for you quick response. So For example I use DOMsplitter to capture all images in html, example of img 👍

<img alt="Something" src="/templates/somtething/images/something.gif" width="105" height="110" hspace="15" vspace="5" border="0"/> "

DOMsplitter creates many documents, one document for the above example has value "content": "<img alt="Something" src="/templates/somtething/images/something.gif" width="105" height="110" hspace="15" vspace="5" border="0" " without closure />

with content type : text/plain. So I want to capture "src" and "alt" attributes with DOMtagger I am using those xml configurations for the preParseHandlers"

<preParseHandlers>
       <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
              selector="img"/>
        <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"
                fromField="content"
                sourceCharset= "UTF-8">
                <dom selector = "img" toField="alt" extract = "[attr(alt)]" />
                        <dom selector = "img" toField="src" extract = "[attr(src)]" />
        </tagger>
 </preParseHandlers>

Can you tell my how to add text/plain to my own restrictions. Or you can give me solution using ReplaceTagger because i also find it diffucult to comprehend. Thanks again

essiembre commented 5 years ago

When getting attributes, drop the square brackets. I gave you an example that should work with text/plain.

  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" fromField="content">
      <dom selector="img" toField="alt" extract="attr(alt)" />
      <dom selector="img" toField="src" extract="attr(src)" />
      <restrictTo field="document.contentType">text/plain</restrictTo>
  </tagger>

You may also try adding parser="xml" to the tagger tag if the default (html) does not work.

For the Replace tagger, it could look like this:

  <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="content" toField="alt" regex="true" wholeMatch="true">
          <fromValue>.*alt="(.*?)".*</fromValue>
          <toValue>$1</toValue>
      </replace>
      <replace fromField="content" toField="src" regex="true" wholeMatch="true">
          <fromValue>.*src="(.*?)".*</fromValue>
          <toValue>$1</toValue>
      </replace>
  </tagger>
niozasg commented 5 years ago

Hi Pascal,

Thanks for your answer. Actually I managed to use DOMtagger succesfully only after I removed "fromField" attribute. Norconex always gave me an error that content field was empty.

essiembre commented 5 years ago

Is it possible you are mixing up the content of a document and a field called "content"? The document content/body is not stored into a field (unless you make it so). So by default, the DOMTagger will read the document content/body and not a field. You should only use a fromField if you know your XML is stored in a field.

I am closing since you have a solution.