Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Question: When crawling a website, how to transform ID with documentname and a 32guid hash? #405

Closed dgomesbr closed 6 years ago

dgomesbr commented 7 years ago

Hello all,

While crawling a huge website, sometimes I would ran into having troubles with the id of my document being to large (in case of cloudsearch for example).

I wanted to know if it's possible to apply a transformation for the ID of the document having a 32-GUID + document name, both for the ID and also for the final checksummer at the end.

Started posting on stack-overflow for norconex visibility and historical purposes. https://stackoverflow.com/questions/46530195/when-crawling-a-website-how-to-transform-id-with-documentname-and-a-32guid-hash

essiembre commented 7 years ago

You can use the UUIDTagger from the Importer module. But I would question using this as your document ID since a new one will be generated each time you crawl. So you may not get modifications/deletions working properly (e.g. could appear to cloud search as a new document each time).

It is for this reason the checksummer uses a checksum (MD5 by default) which is guaranteed to be the same each time if the document has not changed.

I understand the need to reduce the ID sometimes. Nothing out of the box just for this yet. The closest maybe would be to truncate using regex with the GenericURLNormalizer`. In the meantime, you could create your ownIURLNormalizer``.

I am marking as a feature request. The latest snapshot of Norconex Commons Lang has StringUtil#truncateWithHash which can shorten strings, appending a numeric hash specific to the missing part (well.. almost). We should probably use it for something like what you want.

essiembre commented 7 years ago

The latest snapshot release has a new TruncateTagger which allows you to truncate long values, appending a hash (and optionally store that in another field).

For Amazon CloudSearch, a new snapshot release of that Committer was made as well. It offers a new <fixBadIds>true</fixBadIds> flag that will perform the truncation for you.

Please confirm.

essiembre commented 6 years ago

This new feature is now part of the official 2.8.0 release (as well as the latest CloudSearch Committer release). Feel free to re-open if not working as expected.