fork-dev / Tracker

Bug and issue tracker for Fork for Mac
511 stars 12 forks source link

Improve performance of ls-files usage in 2.35 for sparse indexes. #1552

Open scratchee opened 2 years ago

scratchee commented 2 years ago

Git has a new feature, sparse index, which allows significant performance improvements for sparse checkouts of large repositories. This has slowly taken effect over the last 3 versions, and now covers most common commands in git 2.35.

Sadly, it doesn't play very well with fork at the moment, which appears to be primarily because of usage of git ls-files within fork.

git ls-files, unlike most git commands on a worktree, needs to list every file in the repo to match past behaviour, even if those files are excluded from the sparse checkout, so calling it means expanding the sparse index into a full index. The git devs have now provided a new --sparse argument to git ls-files to list only files within the index.

git ls-files --sparse > /dev/null  0.03s user 0.01s system 92% cpu 0.045 total
$ time git ls-files > /dev/null         
git ls-files > /dev/null  3.31s user 0.11s system 66% cpu 5.172 total

As can be seen, for my sparse worktree this knocks the cost of git ls-files from 5 second to basically nothing (yes, my repo is very large ☹️ )

This is an Instruments time-profile stack trace of the ls-files command run by fork, showing its the ensure_full_index function (which expands the sparse index) that causes the stall.

Screenshot 2022-02-01 at 13 20 01

Since fork appears to call ls-files a lot (every time it refreshes the file list, so basically whenever you look at it funny), this causes quite a lot of waiting for spinners. ls-files was always slow for me, but it wasn't quite so bad before sparse-index (used to take about 2s iirc), because the index was already expanded, so there was no need spend time expanding it in memory in every call.

Note that ls-files --sparse does now list directories that mark the sparse boundaries, which would need to be filtered from the output (not sure why git devs only provided this mode and not a mode that just lists sparse-included files), although given you already filter out non-sparse files somewhere between the internal command and the ui, so perhaps you'd already filter out the directories anyway.

From git help ls-files

   --sparse
       If the index is sparse, show the sparse directories without expanding to the contained files. Sparse directories will be shown with a trailing slash, such as "x/" for a sparse directory "x".

So effectively the command would be something like: git ls-files --sparse | grep -E -v -e '/$' (ie to exclude any listed "files" that end in a /, since they are just folders from the sparse-index format)

This is new to 2.35, and sparse-index is not the standard setup (sparse checkouts are never going to be the standard scenario), so I understand if you don't want to prioritise this, but it would make fork perform much better for me, so I'd appreciate it if you got around to it.

Whilst there's probably some other commands that also trigger sparse index expansion (at least git apply when staging changes from a diff), I think ls-files is the only one used so heavily and so the only one that would provide a significant speed up (plus it seems pretty easy to fix, hopefully!).

Thanks

scratchee commented 2 years ago

If you want a "business case" for this fix, then concider that currently if someone's repo is too big, it makes fork painful to use as a gui, and there's no work-around other than hacking up the repository or switching to a different gui.

But with this fix, a much simpler solution becomes viable, which could become the recommended solution and allow for more customers. Think of all those big companies with monolithic repos!

Obviously, I am biased in favour of that logic, so take it with a pinch of salt :).

DanPristupov commented 2 years ago

Thank you for reporting this problem and for the deep investigation.

I'm very interested in making Fork work well with sparse index.

Sadly, it doesn't play very well with fork at the moment, which appears to be primarily because of usage of git ls-files within fork.

Well, Fork calls git ls-files --unmerged, but I doubt it can be slow with the --unmerged flag.

We don't really need that request to be honest 🙂. I've been planning to remove it, but never had a time. IIRC, I only left it because it barely takes 30ms (even in 30Gb Chromium repo).

Or do you mean, it is slow in your case because it needs to rebuild the full index (ensure_full_index)?

Can you show the result of time git ls-files --unmerged > /dev/null? And just for my curiosity: time git ls-files --sparse --unmerged > /dev/null

scratchee commented 2 years ago

Hey, thank you, great to hear you're interested!

So with sparse index enabled:

$ time git ls-files --unmerged > /dev/null        
git ls-files --unmerged > /dev/null  3.06s user 0.10s system 98% cpu 3.201 total
$ time git ls-files --unmerged --sparse > /dev/null
git ls-files --unmerged --sparse > /dev/null  0.02s user 0.01s system 86% cpu 0.029 total

with sparse index disabled (no effect at all, as I'd expect):

$ time git ls-files --unmerged > /dev/null         
git ls-files --unmerged > /dev/null  0.14s user 0.07s system 94% cpu 0.224 total
$ time git ls-files --unmerged --sparse > /dev/null
git ls-files --unmerged --sparse > /dev/null  0.14s user 0.07s system 95% cpu 0.216 total

So it looks like ls-files only suffers because of ensure_full_index, and is otherwise fast even on my large git repo.

Sorry, looks like I unintentionally lied earlier,

ls-files was always slow for me, but it wasn't quite so bad before sparse-index (used to take about 2s iirc), because the index was already expanded, so there was no need spend time expanding it in memory in every call.

I should have said:

the "Refresh" option in fork was always slow for me, but it wasn't quite so bad before sparse-index (used to take about 2s iirc)

I conflated things without checking, sorry for confusing the issue.

I did just check to make sure, and with sparse index turned on/off to compare (still using a sparse checkout, a full 200GB checkout would definitely not be fun!).

With a sparse index: Refresh takes about 7s, with at least 3.5s spent in ls-files, and at least 190ms in status With a full index: Refresh takes about 2s, with at least 170ms spent in ls-files, and at least 800ms in status

(time spent in ls-files/status is "at least" since im just judging from an instruments time profile, so it only counts cpu-time)

So all in all, I'd expect using ls-files --sparse should boost my refresh times to sub-1s 🤞

For reference, I'm on an 80GB repo, 200GB (2 million files) full working copies, 3GB (75000 file) sparse working copies, with 160MB index files (5MB sparse index files).

DanPristupov commented 2 years ago

I'll make a build so you could try that.

DanPristupov commented 2 years ago

Here's the build: https://cdn.fork.dev/prerelease/Fork-2.15.11.dmg

I just added --sparse to ls-files and haven't tried it.

scratchee commented 2 years ago

That works, refresh takes ~1s and the ls-files call barely shows up at all in instruments during the refresh, and after I caused a conflict, fork seems to behave the same showing unmerged files.

So this works great as far as I can tell, thank you!

DanPristupov commented 2 years ago

You can keep using that build for now. Hopefully I will remote ls-files completely in the next update.

Could you tell later if using the sparse index improves the overall experience for you? If it really does, we can make the use in Fork more friendly.

In particular I'm interested in seeing real numbers for time git status --porcelain --untracked-files=all > /dev/null for both enabled and disabled sparse index.

scratchee commented 2 years ago

Thanks, will do.

Yes, I'm planning to stick with sparse index for a while and see how it goes.

sparse index:

$ time git status --porcelain --untracked-files=all > /dev/null
git status --porcelain --untracked-files=all > /dev/null  0.08s user 0.50s system 258% cpu 0.228 total
$ time git status --porcelain --untracked-files=all > /dev/null
git status --porcelain --untracked-files=all > /dev/null  0.62s user 0.37s system 121% cpu 0.818 total

To be honest I'd say it's still a little bleeding edge at the moment for too much focus, there's been big improvements, but I've found one or 2 now that still go through the ensure_full_index path unnnecesarily. Faster access to status/commit/add etc makes it a win for me, but it's not yet as clear cut as sparse checkouts and cone mode have been.

scratchee commented 2 years ago

Sorry for reviving an old issue, but I thought it worth mentioning that with the latest version of git the sparse index feature feels a lot more complete, so my warning from the last comment that it might be too bleeding edge is now defunct.

The latest big improvement was to stash performance which went from around 4s without sparse index and 8s with it to 0.5s with the latest git version of sparse index, which is a massive improvement.

So if you did want to experiment with more integration with sparse checkouts and sparse index, I'd say its about ready for that.