Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

question - itemscope and itemtype #47

Closed danizen closed 7 years ago

danizen commented 7 years ago

Using the technique in #44, I discovered I didn't need to do anything to extract schema.org metadata, because either Norconex importer or Tika will create metadata for objects within an itemscope.

My question is just whether this is Norconex or Tika. This affects my evaluation for my boss - if Tika, we had it either way, if Norconex, it is another strong win for Norconex over Scrapy, and #44 is providing an environment much like the Scrapy shell.

essiembre commented 7 years ago

It is a mix of both, but likely Tika for what you mention. Metadata/fields contained within the document itself is normally extracted by Tika parsers (for the vast majority of files). You can find out what fields a parser extracts for each docs by adding two DebugTaggers that print all fields. One before and one after parsing occurs (last in pre-parse handlers and first in post-parse handlers).

danizen commented 7 years ago

DebugTagger before and after is too much looking at logs for me - I'd probably just use jjs to run them. python JPype also works great, but requires more infrastructure for users than just Java 8.

danizen commented 7 years ago

This doesn't yet work very well for preParseHandlers, but maybe I can make it do so by providing examples.

essiembre commented 7 years ago

What does not work well as preParseHandlers? DebugTagger? It should work just as fine, except you probably do not have many fields gathered at that point.

danizen commented 7 years ago

Here is what I'm having trouble with:

<preParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
    <dom selector="*[itemtype^='http://schema.org']" toField="schemaorg_itemtype" extract="attr(itemtype)" overwrite="true"/>
  </tagger>
</preParseHandlers>

I've even tried to make it very simple to test, and defined it both preParse and postParse:

<preParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
    <dom selector="h2" toField="schemaorg_itemtype" overwrite="true" defaultValue="notfound"/>
  </tagger>
</preParseHandlers>

I'm testing these on https://davidwalsh.name/twitter-cards, saved as a file.

danizen commented 7 years ago

Follow-up - I've proved with javascript on-top of the libraries that if I create a FileInputStream and parse the document with Jsoup, and then select on *[itemscope], it works, and DOMUtil.getElementValue() has no trouble extracting it. However it doesn't seem to work in my importer.

I have gone ahead and used the DebugTagger finally, but it doesn't shed any light on the problem.

essiembre commented 7 years ago

Not sure what your problem is, as I saved your sample file and ran the Importer with your first config snippet above as a test, and it worked as expected. The generated *.meta file had this in it:

schemaorg_itemtype=http\://schema.org/Article^|~http\://schema.org/Article^|~http\://schema.org/Article^|~http\://schema.org/Article^|~http\://schema.org/Article

Are you not getting this? Can it be something else in your config modifying the content before it reaches the DOMTagger?

danizen commented 7 years ago

It must be, or a version difference of some sort. I'll try commenting everything except this out.

danizen commented 7 years ago

I just found that this was caused by setting the content-type explicitly, like this -t "text/html; charset=UTF-8". Did I specify the Content-Type incorrectly?

Specifically, it worked with this command-line:

./importer.sh -c config/test-importer.xml -i testdata/davidwalsh.name-twitter-cards.htm -o importer.txt

It's behavior was unexpected with this command-line:

./importer.sh -c config/test-importer.xml -i testdata/davidwalsh.name-twitter-cards.htm -t 'text/html; charset=utf-8' -o importer.txt

This could well be a case of misunderstanding the CommonRestrictions stuff, but I need to understand why it didn't work.

essiembre commented 7 years ago

The character encoding does not go with the content type. Character encoding can be specified with the -e flag as described from the command usage instructions (which you get when you do not specify any parameters, or here).

This being said, the Importer detects the content-type and charset. You normally use these flags only if you suspect the detection was wrong.

danizen commented 7 years ago

What about changing the charset and the MIME type? Suppose I have a Transformer that changes the charset and/or the MIME-type - should that transformer also update document.contentEncoding and/or document.contentType when it does that?

I'm almost done with a BoilerpipeTagger and BoilerpipeTransformer. The former I plan to use with the ArticleSentencesExtractor from boilerpipe to a separate field. The latter is just to complete the picture, but I want to do the right thing. Conceptually, using the BoilerpipeTransformer would lead to greater precision and the expense of recall, as it will remove text like "Skip to main content" by recognizing that this is not the main content.

essiembre commented 7 years ago

The document.contentType and document.contentEncoding are those initially detected and put there for you to use if you need them. You can delete or change them if you wish. If you know you are changing the content the charset to a specific one, you can use the ReplaceTagger or ConstantTagger to change it. But if you are not using these fields, there is no point. So it is based on your specific requirements.

Note that the parsers normally converts extracted text to UTF-8.

danizen commented 7 years ago

OK - thanks.

angelo337 commented 7 years ago

@danizen hi there could you please share with us the boilerpipe configuration on norconex? I could not figure it out. thanks a lot

essiembre commented 7 years ago

@danizen if you want to answer @angelo337, I invite you to do it here: #48.