Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

RegexReferenceFilter #109

Closed csaezl closed 9 years ago

csaezl commented 9 years ago

I need the crawler to reject some URLs, those that include /../. I use the filter:

#set($filterRegexRef  = "com.norconex.collector.core.filter.impl.RegexReferenceFilter")
...
<filter class="$filterRegexRef" onMatch="exclude">*/\.\./*</filter>      

but get the error:

com.norconex.collector.core.CollectorException: Cannot load crawler configurations.
...
Caused by: com.norconex.commons.lang.config.ConfigurationException: Could not instantiate object from configuration for node: filter key: null
...
Caused by: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/\.\./*
^

What is the proper syntax in this context?.

I have other "include" filters, first the "exclude", followed but several "include". One URL may match one "exclude" and one "include". In this case I need the URL to be excluded. Is there any strategy to accomplish this (apart from defining the filter)?

OkkeKlein commented 9 years ago

The error is caused because of faulty regex. it needs to be .*/\.\./.*

csaezl commented 9 years ago

Thank you!

csaezl commented 9 years ago

Are you sure?. It seems that the crawler doesn't reject them. As a regular expression, for example in Solr */\.\./.* doesn't get the same result as .*/\.\./.*. The second doesn't get any result

OkkeKlein commented 9 years ago

You can't search in Solr with regex. Just with wildcards: https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

As to your links, i just told you how to fix the error. Haven't looked at the logic.

Aren't links with /../ normalized?

essiembre commented 9 years ago

No normalization is performed by default, but many URL normalization options are offered using the urlNormalizer tag. For the /../, you can add this to your config:

<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
  <normalizations>
    removeDotSegments
  </normalizations>
</urlNormalizer>

Typically though, you would want more than just this one. Have a look at the GenericURLNormalizer class for all configuration options. The GenericNormalizer class is a wrapper around a generic class found in Norconex Commons Lang called URLNormalizer. You can find detailed description of each normalization options on that last link.

csaezl commented 9 years ago

I've stopped a crawler that was committing /../ web pages, I've put the urlNormalizerparameter, resumed the run but the crawler is still committing /../ pages, perhaps because it had already extracted them before the stop. Is that right?. Is there a way of rejecting them without loosing the crawled pages information?

essiembre commented 9 years ago

If you change your config after you have indexed some documents in Solr, it won't go into Solr to delete documents matching your new config changes.

Once your crawler finishes after you resumed it, one thing you may try is to set the orphan strategy to delete:

<orphansStrategy>DELETE</orphansStrategy>

That way, the next crawl you start will apply your normalization. When it is done, it will check if any URLs remain from the previous run, and will delete them all. Your old non-normalized URLs should be caught and deleted at that time.

csaezl commented 9 years ago

I don't need the crawler to delete documents from Solr.

Since a running crawler has to be stopped and resumed to apply removeDotSegments to its config file, that makes possible that some /../ URLs could have been extracted (without normalization) before stop and processed in the resume run. This way, in the resume run those URLs circumvent removeDotSegments and get indexed in Solr with /../ .

Is there anything I can do to avoid it?

essiembre commented 9 years ago

You can start fresh again instead of resuming. If you do not want to do this, and if those faulty URLs have not been committed yet (i.e. are still in the queue), I suppose you can try a few things:

Suggestion 1: Filter them out in the importer section:

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
        onMatch="exclude" field="document.reference" >
      /\.\./
</filter>

Suggestion 2: Mimic this normalization exercise in the importer section. Could be though to implement in XML. This example might work for one level only:

<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
  <replace fromField="document.reference" regex="true">
      <fromValue>(/[^./]+)(/\.\./)</fromValue>
      <toValue>/</toValue>
  </replace>
</tagger>

Not sure if changing the document.reference value will not cause other issues though (something to keep an eye on).

essiembre commented 9 years ago

Closing for receiving no feedback in a while on last answer. Please create a new issue if you have additional questions.