Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Issues of deleting existing documents & long URL #382

Closed umeshkalia closed 7 years ago

umeshkalia commented 7 years ago

I am trying to crawl one of my website using Norconex collector-http and committer to submit documents to AWS Cloudsearch. I have made good progress but facing some issues as described below:

  1. In my first attempt, I have crawled few documents of website and submitted to AWS Cloudsearch. Later, I found that some of documents with URL having text "cite-my-term" should not have been crawled. So, I added following code in my config.xml file.

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude"> .*cite-my-term.* </filter>

Now, my HTTP-collector rejects any URL containing "cite-my-term" when I start new crawl, which is very good. But problem is that how can I delete documents with url containing "cite-my-term" already committed to AWS Cloudsearch? I can still see these documents in AWS Cloudsearch.

  1. Another problem I am facing with a long URL https://www.somewebsite.com/pages/testterm?term=State-Dependent+Retrieval+%28State+Dependent+Learning+And+State+Dependent+Memory%29

While committing this URL, I get following exception and operations breaks:

com.norconex.committer.core.CommitterException: Could not upload request to CloudSearch: { [""id" must be less than 128 characters

I tried to add following in my configuration file, but did not get success: `

removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, removeDotSegments, encodeNonURICharacters, addWWW

`

Please suggest solutions to these issues asap.

Thanks in advance.

essiembre commented 7 years ago

1) It is not always possible to have all post-index configuration changes reflected in your index. In this case, I would test setting the following in your crawler config:

<orphansStrategy>DELETE</orphansStrategy>

If that does not work, I am afraid you will have to query CloudSearch for the ids to delete yourself. You may also start a fresh crawl on a brand new index with your new config.

2) The CloudSearch max ID length is quite limited. Using the URLNormalizer, you could have a custom replacements that will truncate long URLs. You are recommended to store the URL in a different field also, that will not be truncated. Unfortunately, the truncation may create duplicates so you may have to be careful what you "cut" from long URLs. Example (not tested):

  <urlNormalizer>
    <normalizations>
        ...
    </normalizations>
    <replacements>
      <!-- to keep begining -->
      <replace>
         <match>^(.{0,128}).*</match>
         <replacement>$1</replacement>
      </replace>
      <!-- to keep end -->
      <replace>
         <match>.*?(.{0,128})$</match>
         <replacement>$1</replacement>
      </replace>
    </replacements>
  </urlNormalizer>

You could also use a UUIDTagger from the importer module, but unfortunately, it will be a new one each time a document is crawled, which is no good for detecting changes/deletions. You can investigate creating your own URLNormalizer too, that would convert URLs to a hash of some kind (like those URL shortener services), or assign new URLs to to a sequence id maintained in some database.

umeshkalia commented 7 years ago

Both of your solutions worked GREAT.

I used UUID for unique Document ID instead of URL and stored URL in other field at aws cloudsearch.

Thanks a lot :)

essiembre commented 7 years ago

You're welcome.