Issues of deleting existing documents & long URL

umeshkalia commented 7 years ago

I am trying to crawl one of my website using Norconex collector-http and committer to submit documents to AWS Cloudsearch. I have made good progress but facing some issues as described below:

In my first attempt, I have crawled few documents of website and submitted to AWS Cloudsearch. Later, I found that some of documents with URL having text "cite-my-term" should not have been crawled. So, I added following code in my config.xml file.

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude"> .*cite-my-term.* </filter>

Now, my HTTP-collector rejects any URL containing "cite-my-term" when I start new crawl, which is very good. But problem is that how can I delete documents with url containing "cite-my-term" already committed to AWS Cloudsearch? I can still see these documents in AWS Cloudsearch.

Another problem I am facing with a long URL https://www.somewebsite.com/pages/testterm?term=State-Dependent+Retrieval+%28State+Dependent+Learning+And+State+Dependent+Memory%29

While committing this URL, I get following exception and operations breaks:

com.norconex.committer.core.CommitterException: Could not upload request to CloudSearch: { [""id" must be less than 128 characters

I tried to add following in my configuration file, but did not get success: `

removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, removeDotSegments, encodeNonURICharacters, addWWW

`

Please suggest solutions to these issues asap.

Thanks in advance.

essiembre commented 7 years ago

1) It is not always possible to have all post-index configuration changes reflected in your index. In this case, I would test setting the following in your crawler config:

<orphansStrategy>DELETE</orphansStrategy>

If that does not work, I am afraid you will have to query CloudSearch for the ids to delete yourself. You may also start a fresh crawl on a brand new index with your new config.

2) The CloudSearch max ID length is quite limited. Using the URLNormalizer, you could have a custom replacements that will truncate long URLs. You are recommended to store the URL in a different field also, that will not be truncated. Unfortunately, the truncation may create duplicates so you may have to be careful what you "cut" from long URLs. Example (not tested):

  <urlNormalizer>
    <normalizations>
        ...
    </normalizations>
    <replacements>
      <!-- to keep begining -->
      <replace>
         <match>^(.{0,128}).*</match>
         <replacement>$1</replacement>
      </replace>
      <!-- to keep end -->
      <replace>
         <match>.*?(.{0,128})$</match>
         <replacement>$1</replacement>
      </replace>
    </replacements>
  </urlNormalizer>

You could also use a UUIDTagger from the importer module, but unfortunately, it will be a new one each time a document is crawled, which is no good for detecting changes/deletions. You can investigate creating your own URLNormalizer too, that would convert URLs to a hash of some kind (like those URL shortener services), or assign new URLs to to a sequence id maintained in some database.

umeshkalia commented 7 years ago

Both of your solutions worked GREAT.

I used UUID for unique Document ID instead of URL and stored URL in other field at aws cloudsearch.

Thanks a lot :)

essiembre commented 7 years ago

You're welcome.

Norconex / crawlers

Issues of deleting existing documents & long URL #382