Closed popthink closed 8 years ago
By default the Importer module will try to parse and extract the text from all content types. You can disable parsing for HTML files this way:
<importer>
<documentParserFactory>
<ignoredContentTypes>text/html</ignoredContentTypes>
</documentParserFactory>
</importer>
One possible drawback is disabling parsing will also disable extracting of metadata fields found in your HTML documents. If you do not need those, then you are just fine. If you do want them extracted as separate fields, you can keep parsing HTML files, but copy their content yourself into a field before parsing occurs, with a pre-parse handler, like this:
<importer>
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
<pattern field="myTargetField">.*</pattern>
<restrictTo field="document.contentType">text/html</restrictTo>
</tagger>
</preParseHandlers>
</importer>
You will then fine the raw HTML in the newly created metadata field myTargetField
.
Let me know if one of these two approaches does it for you.
Thank you very much!
Second one is the solution what I was looking for.
Great!
Hi Pascal,
I would like to ask two questions:
Keeping in mind that we do need any meta data to be extracted - but after doing that we need just the RAW content.
Thanks
If you do not parse a document, the metadata it contains is normally not extracted. If you want the metadata extracted, but want the keep the HTML intact, here a way you can do it.
TextPatternTagger
as a pre-parse handler like described before in this thread. That will give you the HTML you want in the field of your choice.SubstringTransformer
as a post-parse handler to make the content a zero-length substring. Hi Pascal,
Thanks- I have gotten it working as you have suggested. How can we do the same with binary content like pdf files? I need the pure pdf file and I need any metadata associated with it also, like the file properties?
Thanks -yogesh
On Jun 23, 2019, at 1:30 AM, Pascal Essiembre notifications@github.com wrote:
If you do not parse a document, the metadata it contains is normally not extracted. If you want the metadata extracted, but want the keep the HTML intact, here a way you can do it.
Use the TextPatternTagger as a pre-parse handler like described before in this thread. That will give you the HTML you want in the field of your choice. Parse the document as usual (no nothing special there). That will give you all metadata extracted. Then, either ignore the extracted text or use a class like the SubstringTransformer as a post-parse handler to make the content a zero-length substring. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
This gets more tricky. The parsing is done by the Importer module and its purpose is to extract text. What will you do with the PDF in the end? You can use the keepDownloads
feature but it will not be linked to your documents.
You can obviously write your own parser or parser wrapper to parse a copy and keep the original.
If you would like something like that natively supported by the HTTP Collector (or Importer), please open a new ticket and we will make it a feature request (given this one was about HTML and it is now closed).
In the meantime, you could maybe look at using an ExternalTagger
for PDF files, invoking a separate instance of Norconex Importer to parse and obtain metadata. This is more of a hack than a real solution but may possibly work.
I'm trying to save 'Raw HTML Source' in the Committer Class.
But Only 'text' is passed as InputStream arg of Committer Method(queueAddition).
How Can I get the raw html source code in committer?
Thank you :)