Closed inpsolr closed 6 years ago
I see here, in the example, that there is a new way of specifying the html tags to match with regex. I will try this. https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/transformer/impl/StripBetweenTransformer.html
Are you referring to <![CDATA[ ]]>
? If so, know that this is part of XML standard so not something introduced by the HTTP Collector. It is a way to specify any content without worrying about escaping. You can find more information here: https://en.wikipedia.org/wiki/CDATA
Was your last attempt at specifying HTML tags successful?
I have been trying to get this working for the past couple of days and I think I am doing something fundamentally wrong. I have the following config snippet in crawler's config.xml:
<preParseHandlers>
<transformer class="$transformerStripBetween" inclusive="true" >
<restrictTo field="document.contentType">text/html</restrictTo>
<stripBetween>
<start><![CDATA[<div class="content-area">]]></start>
<end><![CDATA[<div class="bottom">]]></end>
</stripBetween>
</transformer>
</preParseHandlers>
But, the search results are showing up for texts that are only present between <div class="content-area">
and <div class="bottom">
in the page. Is there something that I am missing?
Do you want to skip the content-area? It looks like you are doing the opposite. StringBetween is for removing content between tags. So I suggest you have one stripBetween
matching your header and another one for your footer. Alternatively, you can look at ReplaceTransformer
.
Hi @essiembre,
I am trying to not index the texts between the two tags <div class="content-area">
and <div class="bottom">
.
It looks like you are doing the opposite.
I don't understand how I am doing the opposite?
StringBetween is for removing content between tags. So I suggest you have one stripBetween matching your header and another one for your footer.
Could you please give me an example for this specific case? Shouldn't one stripBetween tag be enough?
I looked into ReplaceTransformer; I don't believe that is what I need for this particular situation, unless I replace the text with empty string in <toValue>
.
In which case it should work. Can you share a problematic URL as well as your config?
Hi @essiembre,
I have sent you and email with the sample HTML of the URL I am indexing along with the config.xml for the crawler.
I was able to reproduce. Your config is good. The issue is one of file size. Most transformers dealing with text have a limit of characters it reads at once to avoid memory issues on very large files. You can easily fix this by adding a "maxReadSize" attribute with a large value. For example, this was tested to work:
<transformer maxReadSize="10000000" class="$transformerStripBetween" inclusive="true" >
...
</transformer>
The next release will likely increase the default max read size.
Please confirm.
FYI, the latest snapshot release drastically increased the default maximum read size (default is now 10 million characters). Please re-open if you still have issues.
Sorry for late reply. Yes, this change actually worked for me. Thank you very much. Will also test the new snapshot without the explicit specification for maxReadSize
. Thank you for taking care of this so quickly.
Hi there,
In the example you have given here, the < and > signs are escaped:
<start></start>
and<end></end>
tags?For example, would this work?
or should it be: