Closed danizen closed 7 years ago
It is a mix of both, but likely Tika for what you mention. Metadata/fields contained within the document itself is normally extracted by Tika parsers (for the vast majority of files). You can find out what fields a parser extracts for each docs by adding two DebugTagger
s that print all fields. One before and one after parsing occurs (last in pre-parse handlers and first in post-parse handlers).
DebugTagger before and after is too much looking at logs for me - I'd probably just use jjs to run them. python JPype also works great, but requires more infrastructure for users than just Java 8.
This doesn't yet work very well for preParseHandlers, but maybe I can make it do so by providing examples.
What does not work well as preParseHandlers? DebugTagger? It should work just as fine, except you probably do not have many fields gathered at that point.
Here is what I'm having trouble with:
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="*[itemtype^='http://schema.org']" toField="schemaorg_itemtype" extract="attr(itemtype)" overwrite="true"/>
</tagger>
</preParseHandlers>
I've even tried to make it very simple to test, and defined it both preParse and postParse:
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="h2" toField="schemaorg_itemtype" overwrite="true" defaultValue="notfound"/>
</tagger>
</preParseHandlers>
I'm testing these on https://davidwalsh.name/twitter-cards, saved as a file.
Follow-up - I've proved with javascript on-top of the libraries that if I create a FileInputStream and parse the document with Jsoup, and then select on *[itemscope]
, it works, and DOMUtil.getElementValue()
has no trouble extracting it. However it doesn't seem to work in my importer.
I have gone ahead and used the DebugTagger
finally, but it doesn't shed any light on the problem.
Not sure what your problem is, as I saved your sample file and ran the Importer with your first config snippet above as a test, and it worked as expected. The generated *.meta file had this in it:
schemaorg_itemtype=http\://schema.org/Article^|~http\://schema.org/Article^|~http\://schema.org/Article^|~http\://schema.org/Article^|~http\://schema.org/Article
Are you not getting this? Can it be something else in your config modifying the content before it reaches the DOMTagger?
It must be, or a version difference of some sort. I'll try commenting everything except this out.
I just found that this was caused by setting the content-type explicitly, like this -t "text/html; charset=UTF-8". Did I specify the Content-Type incorrectly?
Specifically, it worked with this command-line:
./importer.sh -c config/test-importer.xml -i testdata/davidwalsh.name-twitter-cards.htm -o importer.txt
It's behavior was unexpected with this command-line:
./importer.sh -c config/test-importer.xml -i testdata/davidwalsh.name-twitter-cards.htm -t 'text/html; charset=utf-8' -o importer.txt
This could well be a case of misunderstanding the CommonRestrictions stuff, but I need to understand why it didn't work.
The character encoding does not go with the content type. Character encoding can be specified with the -e
flag as described from the command usage instructions (which you get when you do not specify any parameters, or here).
This being said, the Importer detects the content-type and charset. You normally use these flags only if you suspect the detection was wrong.
What about changing the charset and the MIME type? Suppose I have a Transformer that changes the charset and/or the MIME-type - should that transformer also update document.contentEncoding
and/or document.contentType
when it does that?
I'm almost done with a BoilerpipeTagger and BoilerpipeTransformer. The former I plan to use with the ArticleSentencesExtractor
from boilerpipe to a separate field. The latter is just to complete the picture, but I want to do the right thing. Conceptually, using the BoilerpipeTransformer would lead to greater precision and the expense of recall, as it will remove text like "Skip to main content" by recognizing that this is not the main content.
The document.contentType
and document.contentEncoding
are those initially detected and put there for you to use if you need them. You can delete or change them if you wish. If you know you are changing the content the charset to a specific one, you can use the ReplaceTagger or ConstantTagger to change it. But if you are not using these fields, there is no point. So it is based on your specific requirements.
Note that the parsers normally converts extracted text to UTF-8.
OK - thanks.
@danizen hi there could you please share with us the boilerpipe configuration on norconex? I could not figure it out. thanks a lot
@danizen if you want to answer @angelo337, I invite you to do it here: #48.
Using the technique in #44, I discovered I didn't need to do anything to extract schema.org metadata, because either Norconex importer or Tika will create metadata for objects within an itemscope.
My question is just whether this is Norconex or Tika. This affects my evaluation for my boss - if Tika, we had it either way, if Norconex, it is another strong win for Norconex over Scrapy, and #44 is providing an environment much like the Scrapy shell.