Open Schnouki opened 7 years ago
I think the original reason for doing that is that there was no other way to get all versions of the file, which was needed in some error recovery cases. That was only required on writes, so I think it could be improved during reads, if that's not already true.
Also, list_file_names
used to be a class B operation iff maxFileCount
was 1, which it is in this remote. That doesn't seem to be the case anymore...
I'll look into this more this weekend when I have some time.
Hi, I just started using this yesterday and already burned through the class C operations included in the free tier after only 2 megabytes of data. Have you found any time to look at this? Unfortunately I don't know golang at all.
It doesn't look like GetFileInfo
is usable as-is, since it takes a B2 fileID
, which B2 generates on the fly at upload time. It might be possible to store extra data in the git-annex
branch with SETSTATE
on upload time, but it seems like that'll run into interesting issues wrt conflicting uploads on the same key (with different chunking values, for example.) Solvable, but very non-trivial. I'll keep this issue open for that improvement.
That said, there are workarounds:
I think there are so many class C calls because git-annex
is calling checkpresentkey
a LOT, even on operations that aren't obviously using that remote (for example, a local git annex drop
), which it does for all remotes whose trust level is semitrusted
(the default, which means that the remote is expected to lose data sometimes.) Changing it to trusted
with git annex trust b2remotename
should get rid of almost all checkpresentkey
calls (the notable exception being git annex fsck --fast --from b2
). It's also useful to set the annex-cost
of this remote relatively high if you have other, non-paid remotes connected so that git-annex
will prefer using them when possible.
Since these are helpful, non-default, and non-obvious, I'll update the readme to mention them.
@timsomers Could you try the git annex trust
command and see if that helps for you? It should, but I might have missed some other way those transactions occur.
hi @encryptio My full command is already "git annex copy --to b2 --not --in b2 --trust b2" so additionally trusting that repo should not make a difference. I've changed it anyway to make sure, but we'll have to wait until tomorrow to see the result.
Can't you just cache the output of list_file_names
in a temp file and refresh it eg. once a minute?
Trusting the repo changed nothing. Checking the reports page it seems 2 list calls are made for each upload:
b2_list_file_names 18,191 b2_upload_file 9,095
Is this necessary? Can't we at least reduce this to a single call?
@timsomers Added a cache to the ListFileNames
call. Could you try that out?
I had to think pretty hard about making sure it's actually safe to do so, and ended up concluding that it's not significantly worse than the existing race condition (see 2bf053ca0a3f650856f489ce3d76528fc0ed35e6 for details on the race.)
I've built this and it does indeed improve things. Now I manage to upload about 2.7k files with 2.5k transactions. I wanted to try with a longer cache time (I don't believe the race condition you mentioned applies to me, as I only push from a single repo) but I didn't immediately find how to rebuild my custom code, just the go get github.com/encryptio/git-annex-remote-b2
command which pulls in your repo.
If you'd like to adjust and rebuild, edit the source in $GOPATH/src/github.com/encryptio/git-annex-remote-b2/main.go
then run go install
from that directory and it'll place a built binary in $GOPATH/bin
like go get
did.
A longer cache time wouldn't improve things, since the thing it's caching is for a single upload. git-annex
calls CHECKPRESENT
immediately before calling STORE
, and this one-item in-memory cache reuses the result of CHECKPRESENT
for the first half of the STORE
operation.
Getting better than one ListFileNames
call per upload is very difficult, would roughly double the amount of data and changes in the git-annex
branch (notable for people who have large git-annex
repos, like me), and would not be a backwards-compatible change, so there'd have to be a new versioning system put in place for the config operations, which comes with its own difficulties (like testing complications and the very unobvious "remove a remote completely and add it again but then my data is gone" problem (because it uses a different config version which makes different assumptions about the data format)).
I've been using this and noticed the high number of class C transactions, so I modified it to be able to cache the full bucket contents to memory for an entire invocation, or for a duration which can be set by the user: https://github.com/meristo/git-annex-remote-b2/commit/7ef35c6a62721eb041347957c665accb424c89df
Has anyone used @meristo 's patch? I'll try giving it a go in the next week and report back.
This remote uses lots of class C transactions to the B2 API, which can be quite expensive. I think this is mostly due to the calls to
ListFileNames()
for each operation. Could it be possible to replace them with "simple" calls toGetFileInfo()
, a class B operation?Thanks a lot for your work!