Closed liar666 closed 7 years ago
You are not going crazy, that's a tricky one for sure. It is a known problem with the library used to do this (JSoup). When dealing with HTML, it normalizes it first. In doing so, it sometimes does some "cleanup" on bad HTML. When dealing with fragments containing a portion of a table only, that fragment is ignored (i.e., invalid HTML). One way around it is to tell JSoup to use an XML parser in such case. Unfortunately, the DOMTagger
does not presently support that.
There should be a new snapshot release soon with the ability to specify the parser to use. That should fix your problem. Stay tuned.
Try this new snapshot release which should have a solution for you. You can add parser="xml"
to your second DOMTagger
(with the <restrictTo>
in place):
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="xml">
...
Let me know if that fixes it.
Indeed, that's a tricky one.
Woaw, what a fast answer! Thanks a lot I'll give it a try right now
That works like a charm, thanks a million times again!
Np! Thanks for confirming.
Hi,
I've written a quite complicated crawler which shows a strange behaviour. I've tried to reduce its size to make it more readable. You'll find its code attached.
The problem is the following:
<restrictTo>
in the 2nd DOMTagger, this Tagger enters in action and works (extracts the wanted fields), but the crawler/tagger runs on too many pages, including the non-splitted pages, which is not wanted.<restrictTo>
in the 2nd DOMTagger, this Tagger is never run and only the common fields are extracted for each split.I've checked the
<restrictTo>
's regex millions of times, this is not the problem. But where could I be mistaken?I would be very thankful of any help since I've been stuck on this for 2 days and I'm now convinced it's a small mistake I cannot see by myself.
ResearchPerspectives.xml.txt