restrictTo in second DOMTagger not working?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

restrictTo in second DOMTagger not working? #381

Closed liar666 closed 7 years ago

liar666 commented 7 years ago

Hi,

I've written a quite complicated crawler which shows a strange behaviour. I've tried to reduce its size to make it more readable. You'll find its code attached.

The problem is the following:

If I de-activate the <restrictTo> in the 2nd DOMTagger, this Tagger enters in action and works (extracts the wanted fields), but the crawler/tagger runs on too many pages, including the non-splitted pages, which is not wanted.
If I activate the <restrictTo> in the 2nd DOMTagger, this Tagger is never run and only the common fields are extracted for each split.

I've checked the <restrictTo>'s regex millions of times, this is not the problem. But where could I be mistaken?

I would be very thankful of any help since I've been stuck on this for 2 days and I'm now convinced it's a small mistake I cannot see by myself.

ResearchPerspectives.xml.txt

essiembre commented 7 years ago

You are not going crazy, that's a tricky one for sure. It is a known problem with the library used to do this (JSoup). When dealing with HTML, it normalizes it first. In doing so, it sometimes does some "cleanup" on bad HTML. When dealing with fragments containing a portion of a table only, that fragment is ignored (i.e., invalid HTML). One way around it is to tell JSoup to use an XML parser in such case. Unfortunately, the DOMTagger does not presently support that.

There should be a new snapshot release soon with the ability to specify the parser to use. That should fix your problem. Stay tuned.

essiembre commented 7 years ago

Try this new snapshot release which should have a solution for you. You can add parser="xml" to your second DOMTagger (with the <restrictTo> in place):

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="xml">
   ...

Let me know if that fixes it.

liar666 commented 7 years ago

Indeed, that's a tricky one.

Woaw, what a fast answer! Thanks a lot I'll give it a try right now

liar666 commented 7 years ago

That works like a charm, thanks a million times again!

essiembre commented 7 years ago

Np! Thanks for confirming.