Closed yarikoptic closed 1 year ago
echoing https://git-annex.branchable.com/todo/registerurl__58___do_changes_in_journal___34__in_place__34____63__/#comment-85131966a882ff24431566d146b5634b - can we reorder registerurl calls so that we do all additions of urls to a key first before proceeding to the next key? I thought so since we are probably first fetching all the information first about the keys, so can reorder any way we want, right @jwodder ?
@yarikoptic
I thought so since we are probably first fetching all the information first about the keys
If by "key" you mean "git-annex key" (not to be confused with "S3 key"), we are not. We fetch data about the entries in S3 ordered by path in paginated batches. Fetching everything before doing anything would not be efficient.
yes -- annex key. Well, I know now that we have some very inefficient behavior in git-annex which is if not resolved might need us to make sacrifice in walking s3 efficiency I guess.
To fix this, get git-annex version 10.20220724 (will be released on Monday) or current master.
Then set annex.alwayscompact=false when running this registerurl or anything else that needs to write bulk data to the git-annex branch.
Note that, it's not entirely safe to set that if an older version of git-annex can still be used in the repository. How to avoid that, I have to leave up to you.
Speedup should be massive BTW.
I think I can confirm speed up (will rerun once again) on the original sample case:
ok, I will consider this one resolved :
/mnt/backup/dandi/dandizarrs/3b3684de-ea8e-4246-bbc6-7b84380f1bd9
that we had one big commit with all keys modified in it, so no commit per key
Looking at what processes are busy on smaug I see that
git-annex registerurl --batch --json --json-error-messages
is one of the busiest in terms of IO. I thought that it might be due to "flat" journal , so inquired with @joeyh https://git-annex.branchable.com/todo/more___34__filesystem_efficient__34___journalling__63___/ but he thinks everything is fine on that end.I
strace
'd one of such processes for a few seconds. Full log here . Some observations:may be some of them could be avoided?
MD5E-s560--82188063b1988362cc3050918f493320
in the case of 0176fece-87bd-4a63-acb6-c9f57e3c53e6 zrr and thus leads to the 1. thousands of urls to be added to the same key. 2. likely "copying" that growing journal file for modifications totmp
location and then moving a new version over becoming slower and slower and thus heavy IO I observe.That file grew to over 1MB in size at the end:
othertmp/
and do changes "in place"? (after all location is locked with journal.lck?) Iwillfiled an issue with git-annex on that.any other ideas @jwodder ?