elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.53k stars 24.9k forks source link

Restore of 2.x snapshot throws checksum missing exceptions on 5.0 #18707

Closed abeyad closed 8 years ago

abeyad commented 8 years ago

Lets say we have a repository with a snapshot A created in v2.3.3. Now, if we start ES 5.0 (master branch) and try to restore snapshot A, we get these exceptions:

Recovery failed from null into {Gomi}{S29Q6GFKQDC7m8DlfwAiwQ}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; ]
RecoveryFailedException[[i1][2]: Recovery failed from null into {Gomi}{S29Q6GFKQDC7m8DlfwAiwQ}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null];
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$133(IndexShard.java:1450)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:392)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [i1/o_88kBP1Q8OTmY0VJ-5quA][[i1][2]] IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null];
    at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:311)
    at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:244)
    at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1149)
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$133(IndexShard.java:1446)
    ... 4 more
Caused by: [i1/o_88kBP1Q8OTmY0VJ-5quA][[i1][2]] IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null];
    at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:413)
    at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$387(StoreRecovery.java:246)
    at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:269)
    ... 7 more
Caused by: [i1/o_88kBP1Q8OTmY0VJ-5quA][[i1][2]] IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null];
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:207)
    at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:408)
    ... 9 more
Caused by: java.lang.NullPointerException: checksum must not be null
    at java.util.Objects.requireNonNull(Objects.java:228)
    at org.elasticsearch.index.store.StoreFileMetaData.<init>(StoreFileMetaData.java:64)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot$FileInfo.fromXContent(BlobStoreIndexShardSnapshot.java:316)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot.fromXContent(BlobStoreIndexShardSnapshot.java:515)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshot.fromXContent(BlobStoreIndexShardSnapshot.java:45)
    at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:113)
    at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:111)
    at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:89)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$Context.loadSnapshot(BlobStoreIndexShardRepository.java:342)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:802)

These exceptions are related to the StoreFileMetaData class throwing an exception if the checksum value is null. This is related to the change found here: https://github.com/elastic/elasticsearch/commit/5008694ba1a140c430a92c05ff84885de6a7d28a

The problem is, for snapshots created in 2.x, the segments_N files do not have checksums when stored in the repository, so when we try to restore a snapshot from 2.x into ES 5.0, we get this exception thrown.

Interestingly, it does not prevent the index itself from being restored, as I am able to get and search against the index that was restored from the snapshot and retrieve documents.

Steps to reproduce:

  1. Install ES 2.3.3
  2. In the elasticsearch.yml file, add the line: path.repo: ["/path/to/repository/dir"]
  3. Start ES 2.3.3
  4. Create a repository at the above location: curl -XPUT localhost:9200/_snapshot/my_repo -d '{ "type": "fs", "settings": { "location": "/path/to/repository/dir", "compress": false } }'
  5. Create an index and index documents: curl -XPOST localhost:9200/idx1/type1 -d '{ "name": "ali", "sane": "absolutely not" }' curl -XPOST localhost:9200/idx1/type1 -d '{ "name": "igor", "sane": "partially" }'
  6. Create a snapshot of the index: curl -XPUT "localhost:9200/_snapshot/my_repo/snap1?wait_for_completion=true" -d '{ "indices": ["idx1"] }'
  7. Stop ES 2.3.3
  8. Install ES 5.0 from master branch
  9. In the elasticsearch.yml file, add the line: path.repo: ["/path/to/repository/dir"]
  10. Start ES 5.0
  11. Repeat step 4
  12. Try to restore the snapshot created earlier:curl -XPOST "localhost:9200/_snapshot/my_repo/snap1/_restore"
abeyad commented 8 years ago

@s1monw @imotov What do you think is the best approach for solving this? None of the 2.x snapshots will have checksums for the segments_N files.

abeyad commented 8 years ago

Also, this only seems to happen when the number of documents in the index are few. I suspect when not all primary shards are populated with at least one document, though I need to dig further to confirm this.

abeyad commented 8 years ago

And when this happens, we can not subsequently take a snapshot of the index in question again, getting "primary shard not allocated" errors. The reason is evident when looking at the cluster state for the index:

"idx1" : {
        "shards" : {
          "2" : [
            {
              "state" : "UNASSIGNED",
              "primary" : true,
              "node" : null,
              "relocating_node" : null,
              "shard" : 2,
              "index" : "idx1",
              "restore_source" : {
                "repository" : "my_repo",
                "snapshot" : "snap1",
                "version" : "2.3.2",
                "index" : "idx1"
              },
              "unassigned_info" : {
                "reason" : "ALLOCATION_FAILED",
                "at" : "2016-06-02T20:45:10.822Z",
                "failed_attempts" : 5,
                "delayed" : false,
                "details" : "failed recovery, failure RecoveryFailedException[[idx1][2]: Recovery failed from null into {Aegis}{TQPQL-DTRaq_HhapThIQSg}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; "
              }
            },
            {
              "state" : "UNASSIGNED",
              "primary" : false,
              "node" : null,
              "relocating_node" : null,
              "shard" : 2,
              "index" : "idx1",
              "unassigned_info" : {
                "reason" : "NEW_INDEX_RESTORED",
                "at" : "2016-06-02T20:45:10.629Z",
                "delayed" : false,
                "details" : "restore_source[my_repo/snap1]"
              }
            }
          ],
          "1" : [
            {
              "state" : "UNASSIGNED",
              "primary" : false,
              "node" : null,
              "relocating_node" : null,
              "shard" : 1,
              "index" : "idx1",
              "unassigned_info" : {
                "reason" : "NEW_INDEX_RESTORED",
                "at" : "2016-06-02T20:45:10.629Z",
                "delayed" : false,
                "details" : "restore_source[my_repo/snap1]"
              }
            },
            {
              "state" : "UNASSIGNED",
              "primary" : true,
              "node" : null,
              "relocating_node" : null,
              "shard" : 1,
              "index" : "idx1",
              "restore_source" : {
                "repository" : "my_repo",
                "snapshot" : "snap1",
                "version" : "2.3.2",
                "index" : "idx1"
              },
              "unassigned_info" : {
                "reason" : "ALLOCATION_FAILED",
                "at" : "2016-06-02T20:45:10.819Z",
                "failed_attempts" : 5,
                "delayed" : false,
                "details" : "failed recovery, failure RecoveryFailedException[[idx1][1]: Recovery failed from null into {Aegis}{TQPQL-DTRaq_HhapThIQSg}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; "
              }
            }
          ],
          "4" : [
            {
              "state" : "STARTED",
              "primary" : true,
              "node" : "TQPQL-DTRaq_HhapThIQSg",
              "relocating_node" : null,
              "shard" : 4,
              "index" : "idx1",
              "restore_source" : {
                "repository" : "my_repo",
                "snapshot" : "snap1",
                "version" : "2.3.2",
                "index" : "idx1"
              },
              "allocation_id" : {
                "id" : "6Mrum9dpRPGkklfb9lixEA"
              }
            },
            {
              "state" : "UNASSIGNED",
              "primary" : false,
              "node" : null,
              "relocating_node" : null,
              "shard" : 4,
              "index" : "idx1",
              "unassigned_info" : {
                "reason" : "NEW_INDEX_RESTORED",
                "at" : "2016-06-02T20:45:10.629Z",
                "delayed" : false,
                "details" : "restore_source[my_repo/snap1]"
              }
            }
          ],
          "3" : [
            {
              "state" : "UNASSIGNED",
              "primary" : false,
              "node" : null,
              "relocating_node" : null,
              "shard" : 3,
              "index" : "idx1",
              "unassigned_info" : {
                "reason" : "NEW_INDEX_RESTORED",
                "at" : "2016-06-02T20:45:10.629Z",
                "delayed" : false,
                "details" : "restore_source[my_repo/snap1]"
              }
            },
            {
              "state" : "UNASSIGNED",
              "primary" : true,
              "node" : null,
              "relocating_node" : null,
              "shard" : 3,
              "index" : "idx1",
              "restore_source" : {
                "repository" : "my_repo",
                "snapshot" : "snap1",
                "version" : "2.3.2",
                "index" : "idx1"
              },
              "unassigned_info" : {
                "reason" : "ALLOCATION_FAILED",
                "at" : "2016-06-02T20:45:10.815Z",
                "failed_attempts" : 5,
                "delayed" : false,
                "details" : "failed recovery, failure RecoveryFailedException[[idx1][3]: Recovery failed from null into {Aegis}{TQPQL-DTRaq_HhapThIQSg}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snap1]]; nested: NullPointerException[checksum must not be null]; "
              }
            }
          ],
          "0" : [
            {
              "state" : "STARTED",
              "primary" : true,
              "node" : "TQPQL-DTRaq_HhapThIQSg",
              "relocating_node" : null,
              "shard" : 0,
              "index" : "idx1",
              "restore_source" : {
                "repository" : "my_repo",
                "snapshot" : "snap1",
                "version" : "2.3.2",
                "index" : "idx1"
              },
              "allocation_id" : {
                "id" : "Yli19wMdQnOu4itVUo9IPg"
              }
            },
            {
              "state" : "UNASSIGNED",
              "primary" : false,
              "node" : null,
              "relocating_node" : null,
              "shard" : 0,
              "index" : "idx1",
              "unassigned_info" : {
                "reason" : "NEW_INDEX_RESTORED",
                "at" : "2016-06-02T20:45:10.629Z",
                "delayed" : false,
                "details" : "restore_source[my_repo/snap1]"
              }
            }
          ]
        }
      }

While some primaries are activated, others remain unassigned due to the allocation failure resulting from the missing checksum throwing a NPE.