Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

How can I get the Raw HTML source in the committer phase? #279

Closed popthink closed 8 years ago

popthink commented 8 years ago

I'm trying to save 'Raw HTML Source' in the Committer Class.

But Only 'text' is passed as InputStream arg of Committer Method(queueAddition).

How Can I get the raw html source code in committer?

Thank you :)

essiembre commented 8 years ago

By default the Importer module will try to parse and extract the text from all content types. You can disable parsing for HTML files this way:

<importer>  
  <documentParserFactory>
      <ignoredContentTypes>text/html</ignoredContentTypes>
  </documentParserFactory>
</importer>

One possible drawback is disabling parsing will also disable extracting of metadata fields found in your HTML documents. If you do not need those, then you are just fine. If you do want them extracted as separate fields, you can keep parsing HTML files, but copy their content yourself into a field before parsing occurs, with a pre-parse handler, like this:

<importer>  
  <preParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <pattern field="myTargetField">.*</pattern>
      <restrictTo field="document.contentType">text/html</restrictTo>
    </tagger>
  </preParseHandlers>
</importer>

You will then fine the raw HTML in the newly created metadata field myTargetField.

Let me know if one of these two approaches does it for you.

popthink commented 8 years ago

Thank you very much!

Second one is the solution what I was looking for.

essiembre commented 8 years ago

Great!

dtcyad1 commented 5 years ago

Hi Pascal,

I would like to ask two questions:

  1. In this case, will the html content be sent twice? Once as the extracted text only and then also as the full raw html text?
  2. If we need all the raw content only to be sent - not just html as your example has, do we just not add the restrictTo field?

Keeping in mind that we do need any meta data to be extracted - but after doing that we need just the RAW content.

Thanks

essiembre commented 5 years ago

If you do not parse a document, the metadata it contains is normally not extracted. If you want the metadata extracted, but want the keep the HTML intact, here a way you can do it.

dtcyad1 commented 5 years ago

Hi Pascal,

Thanks- I have gotten it working as you have suggested. How can we do the same with binary content like pdf files? I need the pure pdf file and I need any metadata associated with it also, like the file properties?

Thanks -yogesh

On Jun 23, 2019, at 1:30 AM, Pascal Essiembre notifications@github.com wrote:

If you do not parse a document, the metadata it contains is normally not extracted. If you want the metadata extracted, but want the keep the HTML intact, here a way you can do it.

Use the TextPatternTagger as a pre-parse handler like described before in this thread. That will give you the HTML you want in the field of your choice. Parse the document as usual (no nothing special there). That will give you all metadata extracted. Then, either ignore the extracted text or use a class like the SubstringTransformer as a post-parse handler to make the content a zero-length substring. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

essiembre commented 5 years ago

This gets more tricky. The parsing is done by the Importer module and its purpose is to extract text. What will you do with the PDF in the end? You can use the keepDownloads feature but it will not be linked to your documents.

You can obviously write your own parser or parser wrapper to parse a copy and keep the original.

If you would like something like that natively supported by the HTTP Collector (or Importer), please open a new ticket and we will make it a feature request (given this one was about HTML and it is now closed).

In the meantime, you could maybe look at using an ExternalTagger for PDF files, invoking a separate instance of Norconex Importer to parse and obtain metadata. This is more of a hack than a real solution but may possibly work.