VertNet / dwc-indexer

Google App Engine project for indexing DwC text files into Search API Documents
GNU Lesser General Public License v3.0
0 stars 1 forks source link

Add hash for partitioning #19

Closed tucotuco closed 10 years ago

tucotuco commented 10 years ago

Add hash to index based on Gold Support recommendation to have a parameter to partition index into 10k or so chunks. Given estimate of 20M records, about 2000 hashes are needed. Hence, construct one with doc % 2000.

tucotuco commented 10 years ago

From Google Gold Support "For hash: you can keep on using an existing property that sort of partitions your index or create a new field to store hash id. There is no perfect count on how records should be distributed among these hash values but it would be good to have ~10K results in each partition. The purpose is to reduce the dependency on cursors and rather use batch queries. I see you are already using batch queries for these index. If these docs have hash values, I would recommend spawning parallel tasks to delete each partition independently. In short, we think these timeouts happen when your queries tries to scan through recently deleted (tombstoned) records and times out eventually. With hash, we expect queries to read through different partitions of underlying storage." "...apart from resources do you have another property using which you can partition the index? For instance, I'm looking for something which would spread these deletes independently across various storage shards internally. If not, I would suggest adding a hash value to these documents (e.g., doc % 100) and then you could run these batches independently to delete docs in index."

tucotuco commented 10 years ago

Done in 2dc57553e020cb63514a96af9a1f6000580a55dc. These will be populated as resources get re-harvested and indexed.