When using ParseHandlers what are the rules for escaping HTML characters inside of StripBetween transformer?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

When using ParseHandlers what are the rules for escaping HTML characters inside of StripBetween transformer? #407

Closed inpsolr closed 6 years ago

inpsolr commented 6 years ago

Hi there,

In the example you have given here, the < and > signs are escaped:

                    <!-- Strip content you do not want -->
                    <transformer class="com.norconex.importer.transformer.impl.StripBetweenTransformer"
                          inclusive="true" caseSensitive="false" >
                        <stripBetween>
                            <start>&lt;!-- whatever start text, like a comment --&gt;</start>
                            <end>&lt;!-- whatever end text, like a comment --&gt;</end>
                        </stripBetween>
                        <!-- multiple strignBetween tags allowed -->
                    </transformer>

Is there a comprehensive list of tags or characters that need to be escaped between the <start></start> and <end></end> tags?
Are there any other considerations for adding the matching text from the document within these tags?

For example, would this work?

<start>&lt;div id="helloWorldStart"&gt;</start>

or should it be:

<start>&lt;div id=&quot;helloWorldStart&quot;&gt;</start>

inpsolr commented 6 years ago

I see here, in the example, that there is a new way of specifying the html tags to match with regex. I will try this. https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/transformer/impl/StripBetweenTransformer.html

essiembre commented 6 years ago

Are you referring to <![CDATA[ ]]> ? If so, know that this is part of XML standard so not something introduced by the HTTP Collector. It is a way to specify any content without worrying about escaping. You can find more information here: https://en.wikipedia.org/wiki/CDATA

Was your last attempt at specifying HTML tags successful?

inpsolr commented 6 years ago

I have been trying to get this working for the past couple of days and I think I am doing something fundamentally wrong. I have the following config snippet in crawler's config.xml:

        <preParseHandlers>
            <transformer class="$transformerStripBetween" inclusive="true" >
          <restrictTo field="document.contentType">text/html</restrictTo>
          <stripBetween>
                <start><![CDATA[<div class="content-area">]]></start>
                <end><![CDATA[<div class="bottom">]]></end>
              </stripBetween>
        </transformer>
    </preParseHandlers>

But, the search results are showing up for texts that are only present between <div class="content-area"> and <div class="bottom"> in the page. Is there something that I am missing?

essiembre commented 6 years ago

Do you want to skip the content-area? It looks like you are doing the opposite. StringBetween is for removing content between tags. So I suggest you have one stripBetween matching your header and another one for your footer. Alternatively, you can look at ReplaceTransformer.

inpsolr commented 6 years ago

Hi @essiembre,

I am trying to not index the texts between the two tags <div class="content-area"> and <div class="bottom">.

It looks like you are doing the opposite.

I don't understand how I am doing the opposite?

StringBetween is for removing content between tags. So I suggest you have one stripBetween matching your header and another one for your footer.

Could you please give me an example for this specific case? Shouldn't one stripBetween tag be enough?

I looked into ReplaceTransformer; I don't believe that is what I need for this particular situation, unless I replace the text with empty string in <toValue>.

essiembre commented 6 years ago

In which case it should work. Can you share a problematic URL as well as your config?

inpsolr commented 6 years ago

Hi @essiembre,

I have sent you and email with the sample HTML of the URL I am indexing along with the config.xml for the crawler.

essiembre commented 6 years ago

I was able to reproduce. Your config is good. The issue is one of file size. Most transformers dealing with text have a limit of characters it reads at once to avoid memory issues on very large files. You can easily fix this by adding a "maxReadSize" attribute with a large value. For example, this was tested to work:

<transformer maxReadSize="10000000" class="$transformerStripBetween" inclusive="true" >
...
</transformer>

The next release will likely increase the default max read size.

Please confirm.

essiembre commented 6 years ago

FYI, the latest snapshot release drastically increased the default maximum read size (default is now 10 million characters). Please re-open if you still have issues.

inpsolr commented 6 years ago

Sorry for late reply. Yes, this change actually worked for me. Thank you very much. Will also test the new snapshot without the explicit specification for maxReadSize. Thank you for taking care of this so quickly.