Closed csaezl closed 9 years ago
The error is caused because of faulty regex. it needs to be .*/\.\./.*
Thank you!
Are you sure?. It seems that the crawler doesn't reject them. As a regular expression, for example in Solr */\.\./.*
doesn't get the same result as .*/\.\./.*
. The second doesn't get any result
You can't search in Solr with regex. Just with wildcards: https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
As to your links, i just told you how to fix the error. Haven't looked at the logic.
Aren't links with /../ normalized?
No normalization is performed by default, but many URL normalization options are offered using the urlNormalizer
tag. For the /../
, you can add this to your config:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>
removeDotSegments
</normalizations>
</urlNormalizer>
Typically though, you would want more than just this one. Have a look at the GenericURLNormalizer class for all configuration options. The GenericNormalizer
class is a wrapper around a generic class found in Norconex Commons Lang called URLNormalizer. You can find detailed description of each normalization options on that last link.
I've stopped a crawler that was committing /../
web pages, I've put the urlNormalizer
parameter, resumed the run but the crawler is still committing /../
pages, perhaps because it had already extracted them before the stop. Is that right?. Is there a way of rejecting them without loosing the crawled pages information?
If you change your config after you have indexed some documents in Solr, it won't go into Solr to delete documents matching your new config changes.
Once your crawler finishes after you resumed it, one thing you may try is to set the orphan strategy to delete:
<orphansStrategy>DELETE</orphansStrategy>
That way, the next crawl you start will apply your normalization. When it is done, it will check if any URLs remain from the previous run, and will delete them all. Your old non-normalized URLs should be caught and deleted at that time.
I don't need the crawler to delete documents from Solr.
Since a running crawler has to be stopped and resumed to apply removeDotSegments
to its config file, that makes possible that some /../
URLs could have been extracted (without normalization) before stop
and processed in the resume
run. This way, in the resume
run those URLs circumvent removeDotSegments
and get indexed in Solr with /../
.
Is there anything I can do to avoid it?
You can start fresh again instead of resuming. If you do not want to do this, and if those faulty URLs have not been committed yet (i.e. are still in the queue), I suppose you can try a few things:
Suggestion 1: Filter them out in the importer section:
<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
onMatch="exclude" field="document.reference" >
/\.\./
</filter>
Suggestion 2: Mimic this normalization exercise in the importer section. Could be though to implement in XML. This example might work for one level only:
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="document.reference" regex="true">
<fromValue>(/[^./]+)(/\.\./)</fromValue>
<toValue>/</toValue>
</replace>
</tagger>
Not sure if changing the document.reference
value will not cause other issues though (something to keep an eye on).
Closing for receiving no feedback in a while on last answer. Please create a new issue if you have additional questions.
I need the crawler to reject some URLs, those that include
/../
. I use the filter:but get the error:
What is the proper syntax in this context?.
I have other "include" filters, first the "exclude", followed but several "include". One URL may match one "exclude" and one "include". In this case I need the URL to be excluded. Is there any strategy to accomplish this (apart from defining the filter)?