Closed umeshkalia closed 7 years ago
1) It is not always possible to have all post-index configuration changes reflected in your index. In this case, I would test setting the following in your crawler config:
<orphansStrategy>DELETE</orphansStrategy>
If that does not work, I am afraid you will have to query CloudSearch for the ids to delete yourself. You may also start a fresh crawl on a brand new index with your new config.
2) The CloudSearch max ID length is quite limited. Using the URLNormalizer, you could have a custom replacements that will truncate long URLs. You are recommended to store the URL in a different field also, that will not be truncated. Unfortunately, the truncation may create duplicates so you may have to be careful what you "cut" from long URLs. Example (not tested):
<urlNormalizer>
<normalizations>
...
</normalizations>
<replacements>
<!-- to keep begining -->
<replace>
<match>^(.{0,128}).*</match>
<replacement>$1</replacement>
</replace>
<!-- to keep end -->
<replace>
<match>.*?(.{0,128})$</match>
<replacement>$1</replacement>
</replace>
</replacements>
</urlNormalizer>
You could also use a UUIDTagger
from the importer module, but unfortunately, it will be a new one each time a document is crawled, which is no good for detecting changes/deletions. You can investigate creating your own URLNormalizer too, that would convert URLs to a hash of some kind (like those URL shortener services), or assign new URLs to to a sequence id maintained in some database.
Both of your solutions worked GREAT.
I used UUID for unique Document ID instead of URL and stored URL in other field at aws cloudsearch.
Thanks a lot :)
You're welcome.
I am trying to crawl one of my website using Norconex collector-http and committer to submit documents to AWS Cloudsearch. I have made good progress but facing some issues as described below:
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude"> .*cite-my-term.* </filter>
Now, my HTTP-collector rejects any URL containing "cite-my-term" when I start new crawl, which is very good. But problem is that how can I delete documents with url containing "cite-my-term" already committed to AWS Cloudsearch? I can still see these documents in AWS Cloudsearch.
While committing this URL, I get following exception and operations breaks:
I tried to add following in my configuration file, but did not get success: `
`
Please suggest solutions to these issues asap.
Thanks in advance.