Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

DOMSplitter that also extracts title #54

Closed OkkeKlein closed 7 years ago

OkkeKlein commented 7 years ago

A selector for both content and title would be nice to have.

essiembre commented 7 years ago

Can you elaborate on what you mean by providing an example?

OkkeKlein commented 7 years ago
<div id="diva21"><a href="javascript:void(0);" class="faqlink"  onClick="reply_click('21')">This is a question?</a></div>
<div id="divb21" class="slidingDiv"><strong>This is a question?</strong><br>Yes. And this is the answer.</div>

So in this case the selectors are "faqlink" and "slidingDiv"

essiembre commented 7 years ago

I see, do you have a tag surrounding these two divs (diva21 and divb21)? If so, one way to do it would be to split on that parent tag. Then, for each doc obtained you can extract the title and content separately.

essiembre commented 7 years ago

From the sample you sent me by email, it looks the div pairs are not surrounded by a parent tag. Do you control the page content? If so, can you surround each pair with div? E.g.:

<div class="faq">
  <div id="diva21" class="faqlink">... title ...</div>
  <div id="divb21" class="slidingDiv">... content ...</div>
</div>

If you can modify it to match the above, you can split on the "faq" div. Then you'll obtain a new document for each pair and you can use the DOMTagger again (or else) to obtain the title and content separately.

essiembre commented 7 years ago

If you cannot modify the page at the source, you can do so with a transformer, such as ReplaceTransformer as a pre-parse handler, before your DOMSplitter. Example (not tested):

<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <replace>
          <fromValue><![CDATA[(<div.*?class="faqlink".*?</div>.*?</div>)]]></fromValue>
          <toValue><![CDATA[<div class="faq">$1</div>]]></toValue>
      </replace>
  </transformer>

This example will wrap each pair with <div class="faq">...</div> for easy splitting by the DOMSplitter.

Could this work for you?

OkkeKlein commented 7 years ago

I resolved the issue using the TextPatternTagger but these workarounds are quite interesting.