Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Is it possible to keep html tag in .cntnt ? #22

Closed fensifan closed 8 years ago

fensifan commented 8 years ago

Hi Pascal,

I'm doing a little project with norconex http collector which will fetch news that with my city in the keywords field of metadata from big news website .

The fetching works well but the content files generated are raw text with all links in it, which is hard to code any program to grab the main article from it. Is there anyway to keep the html tags in the content file?

I have already read issue #15 , sadly I failed to code a custom HTML parser (I'll keep trying) and ignore the parsing of html files will stop regexMetadataFilter from working.

Thanks!

essiembre commented 8 years ago

Normally you would try to rely on the Importer module to extract, transform, rename or otherwise do anything related manipulating the content or metadata/fields of documents. When using a Collector, you would then use or create a Committer to save it in the repository or format of your choice. In other words, you should ideally not have to reprocess files after they have been imported by a collector to extract content.

You say you have the information you want in the HTML metadata. That information should be extracted and be part of the metadata fields for each documents then. Can you simply use that? If it is rather part of the content you want (the "article"), you can also have the importer extract that and store it in a field of your choice for later consumption. Look at the Importer "taggers" and "transformers" all listed in this configuration page.

Once you have the content/fields you exactly want, I think your focus should be on writing a custom committer. The files saved by the Filesystem Committer should almost be considered an inner-format how the crawler store its files and is not the most friendly. As opposed to try read/parse these files for your own use, have a Committer deal with the storing of the documents just how and where you want them.

I think this would work better for you. But if you really want to do your own processing and have the HTML parsed but at the same time kept somewhere, there are things you can do. I would recommend simply copying the content into a field with a pre-parse handler. It could be done with a TextPatternTagger like this (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <!-- replace .* with regex that will just extract your article if you want. -->
      <pattern field="myRawHtmlField">.*</pattern>
      <restrictTo field="document.contentType">text/html</restrictTo>
  </tagger>

But again, I would validate whether you really need to process the files as a separate steps. That approach forces you to keep all files around and adds a separate process you should not need to your crawling solution. When doing it in the importer and using your own committer, there is no need to keep files around. It will also work better for you when doing incremental crawls (active by default) where only changes/additions/deletions would be sent to your committer. In other words, you would not have to worry about which files have been processed already or not in your own process. I think it would make your life easier especially when starting to deal with large amount of documents.

Make sense?

fensifan commented 8 years ago

First I want to apologise for my bad English (I'm a Chinese). I got your idea, I should use importer to extract useful information from the html files, not keep everything and do my own parsing. That helps me a lot, thank you.

You say you have the information you want in the HTML metadata. That information should be extracted and be part of the metadata fields for each documents then. Can you simply use that?

Actually I am using it, since I'm using the keyword in metadata to filter which pages should be collected. What I was worried about is that if I want to keep all html tags the metadata will not be parsed and hence the filter would stop working, but since as you said the importer can get things done it would be pointless to stop parsing the html files, and therefore I don't have to worry about it anymore.

If it is rather part of the content you want (the "article"), you can also have the importer extract that and store it in a field of your choice for later consumption. Look at the Importer "taggers" and "transformers" all listed in this configuration page.

What am I trying to achieve is to keep the content in the

tags of a html page while getting rid of all the other tags. Is there any suggestions which tagger or transformer is specifically suitable for this situation? Thanks a lot for your help.

essiembre commented 8 years ago

If you want to modify the content itself, you could try using the ReplaceTransformer and do some magic with regular expressions.

If you want to extract content and store it in a specific field, you can use TextBetweenTagger in a similar way as a pre-parse handler (not tested):

 <tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="false">
      <textBetween name="myParagraphsField">
          <start><![CDATA[<p>]]></start>
          <end><![CDATA[</p>]]></end>
      </textBetween>
      <restrictTo field="document.contentType">text/html</restrictTo>
  </tagger>

This assumes your <p> tags are always opened and closed properly, which is often not the case in pages. One that may work better as it attempts to fix bad HTML before you can deal with it is the DOMTagger. It may work this way (not tested):

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="p" toField="myParagraphsField" extract="text" />
      <restrictTo field="document.contentType">text/html</restrictTo>
</tagger>

Since you will likely have multiple values for your new field, you can "flatten" it to become a single value with ForceSingleValueTagger.

Another approach if you own the content, is to create special tags or comments in your pages to indentify your header, footer, etc, and strip those. Or, alternatively, you can identify your content with a special tag (or comment) and extract just that.

Let me know if this works for you.

fensifan commented 8 years ago

DOMTagger works great for me. Thank you for your help!

essiembre commented 8 years ago

No problem!