Closed dgomesbr closed 6 years ago
You can use the UUIDTagger
from the Importer module. But I would question using this as your document ID since a new one will be generated each time you crawl. So you may not get modifications/deletions working properly (e.g. could appear to cloud search as a new document each time).
It is for this reason the checksummer uses a checksum (MD5 by default) which is guaranteed to be the same each time if the document has not changed.
I understand the need to reduce the ID sometimes. Nothing out of the box just for this yet. The closest maybe would be to truncate using regex with the GenericURLNormalizer`. In the meantime, you could create your own
IURLNormalizer``.
I am marking as a feature request. The latest snapshot of Norconex Commons Lang has StringUtil#truncateWithHash
which can shorten strings, appending a numeric hash specific to the missing part (well.. almost). We should probably use it for something like what you want.
The latest snapshot release has a new TruncateTagger which allows you to truncate long values, appending a hash (and optionally store that in another field).
For Amazon CloudSearch, a new snapshot release of that Committer was made as well. It offers a new <fixBadIds>true</fixBadIds>
flag that will perform the truncation for you.
Please confirm.
This new feature is now part of the official 2.8.0 release (as well as the latest CloudSearch Committer release). Feel free to re-open if not working as expected.
Hello all,
While crawling a huge website, sometimes I would ran into having troubles with the id of my document being to large (in case of cloudsearch for example).
I wanted to know if it's possible to apply a transformation for the ID of the document having a 32-GUID + document name, both for the ID and also for the final checksummer at the end.
Started posting on stack-overflow for norconex visibility and historical purposes. https://stackoverflow.com/questions/46530195/when-crawling-a-website-how-to-transform-id-with-documentname-and-a-32guid-hash