Closed abeyad closed 8 years ago
@s1monw @imotov What do you think is the best approach for solving this? None of the 2.x snapshots will have checksums for the segments_N files.
Also, this only seems to happen when the number of documents in the index are few. I suspect when not all primary shards are populated with at least one document, though I need to dig further to confirm this.
And when this happens, we can not subsequently take a snapshot of the index in question again, getting "primary shard not allocated" errors. The reason is evident when looking at the cluster state for the index:
"idx1" : {
"shards" : {
"2" : [
{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 2,
"index" : "idx1",
"restore_source" : {
"repository" : "my_repo",
"snapshot" : "snap1",
"version" : "2.3.2",
"index" : "idx1"
},
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2016-06-02T20:45:10.822Z",
"failed_attempts" : 5,
"delayed" : false,
"details" : "failed recovery, failure RecoveryFailedException[[idx1][2]: Recovery failed from null into {Aegis}{TQPQL-DTRaq_HhapThIQSg}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; "
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 2,
"index" : "idx1",
"unassigned_info" : {
"reason" : "NEW_INDEX_RESTORED",
"at" : "2016-06-02T20:45:10.629Z",
"delayed" : false,
"details" : "restore_source[my_repo/snap1]"
}
}
],
"1" : [
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 1,
"index" : "idx1",
"unassigned_info" : {
"reason" : "NEW_INDEX_RESTORED",
"at" : "2016-06-02T20:45:10.629Z",
"delayed" : false,
"details" : "restore_source[my_repo/snap1]"
}
},
{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 1,
"index" : "idx1",
"restore_source" : {
"repository" : "my_repo",
"snapshot" : "snap1",
"version" : "2.3.2",
"index" : "idx1"
},
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2016-06-02T20:45:10.819Z",
"failed_attempts" : 5,
"delayed" : false,
"details" : "failed recovery, failure RecoveryFailedException[[idx1][1]: Recovery failed from null into {Aegis}{TQPQL-DTRaq_HhapThIQSg}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; "
}
}
],
"4" : [
{
"state" : "STARTED",
"primary" : true,
"node" : "TQPQL-DTRaq_HhapThIQSg",
"relocating_node" : null,
"shard" : 4,
"index" : "idx1",
"restore_source" : {
"repository" : "my_repo",
"snapshot" : "snap1",
"version" : "2.3.2",
"index" : "idx1"
},
"allocation_id" : {
"id" : "6Mrum9dpRPGkklfb9lixEA"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 4,
"index" : "idx1",
"unassigned_info" : {
"reason" : "NEW_INDEX_RESTORED",
"at" : "2016-06-02T20:45:10.629Z",
"delayed" : false,
"details" : "restore_source[my_repo/snap1]"
}
}
],
"3" : [
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 3,
"index" : "idx1",
"unassigned_info" : {
"reason" : "NEW_INDEX_RESTORED",
"at" : "2016-06-02T20:45:10.629Z",
"delayed" : false,
"details" : "restore_source[my_repo/snap1]"
}
},
{
"state" : "UNASSIGNED",
"primary" : true,
"node" : null,
"relocating_node" : null,
"shard" : 3,
"index" : "idx1",
"restore_source" : {
"repository" : "my_repo",
"snapshot" : "snap1",
"version" : "2.3.2",
"index" : "idx1"
},
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2016-06-02T20:45:10.815Z",
"failed_attempts" : 5,
"delayed" : false,
"details" : "failed recovery, failure RecoveryFailedException[[idx1][3]: Recovery failed from null into {Aegis}{TQPQL-DTRaq_HhapThIQSg}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; "
}
}
],
"0" : [
{
"state" : "STARTED",
"primary" : true,
"node" : "TQPQL-DTRaq_HhapThIQSg",
"relocating_node" : null,
"shard" : 0,
"index" : "idx1",
"restore_source" : {
"repository" : "my_repo",
"snapshot" : "snap1",
"version" : "2.3.2",
"index" : "idx1"
},
"allocation_id" : {
"id" : "Yli19wMdQnOu4itVUo9IPg"
}
},
{
"state" : "UNASSIGNED",
"primary" : false,
"node" : null,
"relocating_node" : null,
"shard" : 0,
"index" : "idx1",
"unassigned_info" : {
"reason" : "NEW_INDEX_RESTORED",
"at" : "2016-06-02T20:45:10.629Z",
"delayed" : false,
"details" : "restore_source[my_repo/snap1]"
}
}
]
}
}
While some primaries are activated, others remain unassigned due to the allocation failure resulting from the missing checksum throwing a NPE.
Lets say we have a repository with a snapshot
A
created in v2.3.3. Now, if we start ES 5.0 (master branch) and try to restore snapshotA
, we get these exceptions:These exceptions are related to the
StoreFileMetaData
class throwing an exception if the checksum value is null. This is related to the change found here: https://github.com/elastic/elasticsearch/commit/5008694ba1a140c430a92c05ff84885de6a7d28aThe problem is, for snapshots created in 2.x, the segments_N files do not have checksums when stored in the repository, so when we try to restore a snapshot from 2.x into ES 5.0, we get this exception thrown.
Interestingly, it does not prevent the index itself from being restored, as I am able to get and search against the index that was restored from the snapshot and retrieve documents.
Steps to reproduce:
path.repo: ["/path/to/repository/dir"]
curl -XPUT localhost:9200/_snapshot/my_repo -d '{ "type": "fs", "settings": { "location": "/path/to/repository/dir", "compress": false } }'
curl -XPOST localhost:9200/idx1/type1 -d '{ "name": "ali", "sane": "absolutely not" }' curl -XPOST localhost:9200/idx1/type1 -d '{ "name": "igor", "sane": "partially" }'
curl -XPUT "localhost:9200/_snapshot/my_repo/snap1?wait_for_completion=true" -d '{ "indices": ["idx1"] }'
path.repo: ["/path/to/repository/dir"]
curl -XPOST "localhost:9200/_snapshot/my_repo/snap1/_restore"