Question: When crawling a website, how to transform ID with documentname and a 32guid hash?

dgomesbr commented 7 years ago

Hello all,

While crawling a huge website, sometimes I would ran into having troubles with the id of my document being to large (in case of cloudsearch for example).

I wanted to know if it's possible to apply a transformation for the ID of the document having a 32-GUID + document name, both for the ID and also for the final checksummer at the end.

Started posting on stack-overflow for norconex visibility and historical purposes. https://stackoverflow.com/questions/46530195/when-crawling-a-website-how-to-transform-id-with-documentname-and-a-32guid-hash

essiembre commented 7 years ago

You can use the UUIDTagger from the Importer module. But I would question using this as your document ID since a new one will be generated each time you crawl. So you may not get modifications/deletions working properly (e.g. could appear to cloud search as a new document each time).

It is for this reason the checksummer uses a checksum (MD5 by default) which is guaranteed to be the same each time if the document has not changed.

I understand the need to reduce the ID sometimes. Nothing out of the box just for this yet. The closest maybe would be to truncate using regex with the GenericURLNormalizer`. In the meantime, you could create your ownIURLNormalizer``.

I am marking as a feature request. The latest snapshot of Norconex Commons Lang has StringUtil#truncateWithHash which can shorten strings, appending a numeric hash specific to the missing part (well.. almost). We should probably use it for something like what you want.

essiembre commented 7 years ago

The latest snapshot release has a new TruncateTagger which allows you to truncate long values, appending a hash (and optionally store that in another field).

For Amazon CloudSearch, a new snapshot release of that Committer was made as well. It offers a new <fixBadIds>true</fixBadIds> flag that will perform the truncation for you.

Please confirm.

essiembre commented 6 years ago

This new feature is now part of the official 2.8.0 release (as well as the latest CloudSearch Committer release). Feel free to re-open if not working as expected.

Norconex / crawlers

Question: When crawling a website, how to transform ID with documentname and a 32guid hash? #405