Multiple uploads to glacier leave inconsistent sqlite db

marcopaga commented 12 years ago

I use git-annex as a frontend for glacier-cli. If I try to upload files multiple times to glacier the sqllite db is left in an inconsistent state (Multiple rows with the same name)

I wrote a really simple mechanism to handle the case when uploading a new archive by checking the existence of the name in the db.

But I'm not fluent in python and not 100% sure how to handle the inventory cache right.

basak commented 11 years ago

Since Amazon Glacier permits multiple archives with the same archive description, and glacier-cli maps these to "names", I think it's perfectly valid for glacier-cli to permit multiple archives with the same name.

So having multiple rows with the same name in sqlite is not an inconsistent state. It's perfectly valid. See the disambiguation section in README.md for details.

However, perhaps git-annex doesn't expect this. I wasn't aware that it might try writing the same key more than once. Could you please give me details on how I could reproduce this behaviour?

If git-annex needs it, we could request your uniqueness check with a command line option. But I'm not sure what git-annex expects glacier-cli to do in this case. Just fail, or something else? So I'd like to understand this better before making this change.

marcopaga commented 11 years ago

Thank you very much for the clarification. Having multiple files with the same name (key for git-annex) is not desired fot git-annex. Git-annex tracks the content by a unique id (SHA checksum by default). If the key is present everything is fine.

You can reproduce the error by uploading a single file multiple times to glacier.

git annex copy . --to glacier && git annex copy . --to glacier

This will raise an error. The problem is that multiple rows with the same name are present in the database and the access is done via .one().

When I use the trust-glacier option it works as I would expect:

git annex copy . --to glacier --trust-glacier

This doesn't start a transfer if the file seems to be present. This is what I expected. It works like the S3 backend.

Am 08.12.2012 04:39, schrieb basak:

Since Amazon Glacier permits multiple archives with the same archive description, and glacier-cli maps these to "names", I think it's perfectly valid for glacier-cli to permit multiple archives with the same name.

So having multiple rows with the same name in sqlite is not an inconsistent state. It's perfectly valid. See the disambiguation section in README.md for details.

However, perhaps git-annex doesn't expect this. I wasn't aware that it might try writing the same key more than once. Could you please give me details on how I could reproduce this behaviour?

If git-annex needs it, we could request your uniqueness check with a command line option. But I'm not sure what git-annex expects glacier-cli to do in this case. Just fail, or something else? So I'd like to understand this better before making this change.

— Reply to this email directly or view it on GitHub https://github.com/basak/glacier-cli/pull/19#issuecomment-11154478.

sakoht commented 11 years ago

@basak you noted that glacier supports multiple archives with the same name, but glacier-cli reacts has problems if this occurs. You can't retrieve by ID, and retrieving by name when there are multiple matches raises an error, as does attempting to delete one of the N. I had to hand edit the sqlite db to keep working after accidentally uploading two things with the same name.

Should we:

change the code to let the user specify a specific archive ID wherever they previously specified vault+name.
change the code to check for duplicate names in the upload.

While # 2 would suffice for most users, it seems like # 1 is probably more correct.

basak commented 11 years ago

@sakoht

You can't retrieve by ID

You should be able to, and if you can't, then it's a bug. Please could you file a separate issue detailing this situation and how to reproduce it?

retrieving by name when there are multiple matches raises an error

This is by design, since glacier-cli doesn't know which one you mean.

as does attempting to delete one of the N

This should not fail, either, and is a bug. Please include details of this the issue you file above.

basak commented 11 years ago

When I said:

This should not fail, either, and is a bug. Please include details of this the issue you file above.

I meant if you're trying to delete by ID. If you're trying to delete by name, then this is expected, as the instruction is ambiguous.

basak commented 11 years ago

I think the git-annex case is covered well in http://git-annex.branchable.com/bugs/Glacier_remote_uploads_duplicates/, so I'll close this issue, as there's no action that can be taken in glacier-cli.

To reiterate: duplicates exist by design, since they cannot be prevented if you use more than one computer (or cache) to upload to Glacier. Thus, glacier-cli needs to be able to deal with them, and it does - by allowing the user to disambiguate by using IDs. This is detailed in the README. If it doesn't work as specified, please file a bug.

basak commented 11 years ago

@sakoht

I just realised that when you said:

change the code to let the user specify a specific archive ID wherever they previously specified vault+name.

You may not realise that glacier-cli can do this already. Have you read the README?

sakoht commented 11 years ago

I did not study the readme carefully enough. Perhaps the issue is just error messaging. It seems this is being discussed in #34.

basak / glacier-cli

Multiple uploads to glacier leave inconsistent sqlite db #19