Crawling only the first of multiple identical <section> elements

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

Crawling only the first of multiple identical <section> elements #374

Closed sveba closed 7 years ago

sveba commented 7 years ago

Hi Pascal,

I have a site with multiple identical <section> tags on the same level. The content that I want to parse is in the first one. How can I do that?

Tried with combinations of StripAfterTransformer, StripBeforeTransformer and StripBetweenTransformer but nothing did it. Am I missing something in the documentation. Is there some way to use non greedy regex?

Thanx

essiembre commented 7 years ago

Both StringBeforeTransformer and StripAfterTransformer will discard content before/after the first match. So if your <section> is the first one, it should work. Can you share your config?

As an alternative, you can try ReplaceTransformer with regex of your choice. For instance, you can try something like this (not tested):

<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
    <replace>
        <fromValue>[CDATA[<section>(.*?)</section>]]</fromValue>
        <toValue>$1</toValue>
    </replace>
</transformer>

If you want to store the content directly to a field instead, you can also use a tagger, like TextPatternTagger.

sveba commented 7 years ago

Sorry for the late response. I was offline for the weekend.

Here is the "problem": The impl of StripAfterTransformer takes the first 64K chunk and does the deletion after the matched part. Than the second chunk is being processed the same way but it doesn't know that in the first one there has been a match already and goes over the chunk as it is a new content! So if the second chunk doesn't have a match it is being concatenated to the first, strpped one. And so on and so on.

I don't know if this is by design, but IMO it is missleading and actually the only thing you can do is increase the maxReadSize. Which is also not great, because you can not know upfront how big your biggest document will be.

Proposal: Maybe give the method transformStringContent a return value and check it additonally in the while loop of AbstractStringTransformer

Setting maxReadSize to 500K solved my problem. Memory usage goes up, but who cares nowadays about that :)

Cheers

essiembre commented 7 years ago

Many people actually. :-) It is not rare to have setups with many crawlers with tons of threads on the same server, processing many documents at once. Memory can get eaten fast.

64K is a default that handles the majority of scenarios. We'll look at increasing it if it becomes an issue with too many people. The simplest solution is to change the default like you did with maxReadSize. Another approach would be to build a transformer that reads the file as a stream instead of a String. Unfortunately, regular expressions do not work the best with streams.

We can make it a feature request if you like to have a version working with streams and doing exact matches (as opposed to regex). Then the size would not be a concern.

For now, since it is per design and you found what the solution for this was, I am closing this issue.