encryptio / git-annex-remote-b2

git-annex special remote for Backblaze's B2
MIT License
39 stars 9 forks source link

Lots of class C transactions #9

Open Schnouki opened 7 years ago

Schnouki commented 7 years ago

This remote uses lots of class C transactions to the B2 API, which can be quite expensive. I think this is mostly due to the calls to ListFileNames() for each operation. Could it be possible to replace them with "simple" calls to GetFileInfo(), a class B operation?

Thanks a lot for your work!

encryptio commented 7 years ago

I think the original reason for doing that is that there was no other way to get all versions of the file, which was needed in some error recovery cases. That was only required on writes, so I think it could be improved during reads, if that's not already true.

Also, list_file_names used to be a class B operation iff maxFileCount was 1, which it is in this remote. That doesn't seem to be the case anymore...

I'll look into this more this weekend when I have some time.

timsomers commented 7 years ago

Hi, I just started using this yesterday and already burned through the class C operations included in the free tier after only 2 megabytes of data. Have you found any time to look at this? Unfortunately I don't know golang at all.

encryptio commented 7 years ago

It doesn't look like GetFileInfo is usable as-is, since it takes a B2 fileID, which B2 generates on the fly at upload time. It might be possible to store extra data in the git-annex branch with SETSTATE on upload time, but it seems like that'll run into interesting issues wrt conflicting uploads on the same key (with different chunking values, for example.) Solvable, but very non-trivial. I'll keep this issue open for that improvement.

That said, there are workarounds:

I think there are so many class C calls because git-annex is calling checkpresentkey a LOT, even on operations that aren't obviously using that remote (for example, a local git annex drop), which it does for all remotes whose trust level is semitrusted (the default, which means that the remote is expected to lose data sometimes.) Changing it to trusted with git annex trust b2remotename should get rid of almost all checkpresentkey calls (the notable exception being git annex fsck --fast --from b2). It's also useful to set the annex-cost of this remote relatively high if you have other, non-paid remotes connected so that git-annex will prefer using them when possible.

Since these are helpful, non-default, and non-obvious, I'll update the readme to mention them.

encryptio commented 7 years ago

@timsomers Could you try the git annex trust command and see if that helps for you? It should, but I might have missed some other way those transactions occur.

timsomers commented 7 years ago

hi @encryptio My full command is already "git annex copy --to b2 --not --in b2 --trust b2" so additionally trusting that repo should not make a difference. I've changed it anyway to make sure, but we'll have to wait until tomorrow to see the result.

Can't you just cache the output of list_file_names in a temp file and refresh it eg. once a minute?

timsomers commented 7 years ago

Trusting the repo changed nothing. Checking the reports page it seems 2 list calls are made for each upload:

b2_list_file_names 18,191 b2_upload_file 9,095

Is this necessary? Can't we at least reduce this to a single call?

encryptio commented 7 years ago

@timsomers Added a cache to the ListFileNames call. Could you try that out?

I had to think pretty hard about making sure it's actually safe to do so, and ended up concluding that it's not significantly worse than the existing race condition (see 2bf053ca0a3f650856f489ce3d76528fc0ed35e6 for details on the race.)

timsomers commented 7 years ago

I've built this and it does indeed improve things. Now I manage to upload about 2.7k files with 2.5k transactions. I wanted to try with a longer cache time (I don't believe the race condition you mentioned applies to me, as I only push from a single repo) but I didn't immediately find how to rebuild my custom code, just the go get github.com/encryptio/git-annex-remote-b2 command which pulls in your repo.

encryptio commented 7 years ago

If you'd like to adjust and rebuild, edit the source in $GOPATH/src/github.com/encryptio/git-annex-remote-b2/main.go then run go install from that directory and it'll place a built binary in $GOPATH/bin like go get did.

A longer cache time wouldn't improve things, since the thing it's caching is for a single upload. git-annex calls CHECKPRESENT immediately before calling STORE, and this one-item in-memory cache reuses the result of CHECKPRESENT for the first half of the STORE operation.

Getting better than one ListFileNames call per upload is very difficult, would roughly double the amount of data and changes in the git-annex branch (notable for people who have large git-annex repos, like me), and would not be a backwards-compatible change, so there'd have to be a new versioning system put in place for the config operations, which comes with its own difficulties (like testing complications and the very unobvious "remove a remote completely and add it again but then my data is gone" problem (because it uses a different config version which makes different assumptions about the data format)).

meristo commented 7 years ago

I've been using this and noticed the high number of class C transactions, so I modified it to be able to cache the full bucket contents to memory for an entire invocation, or for a duration which can be set by the user: https://github.com/meristo/git-annex-remote-b2/commit/7ef35c6a62721eb041347957c665accb424c89df

greggrossmeier commented 6 years ago

Has anyone used @meristo 's patch? I'll try giving it a go in the next week and report back.