Open bradlarsen opened 1 year ago
Note, this feels similar to the problem I reported in #950.
I have seen this sort of tree parsing error arise with gix
several hundred times when examining the content of ~7500 Git repositories using Nosey Parker. They probably are all failing to decode in the same way.
Thanks a lot for this fantastic report! I see how much time you must have spent not only to compile it, but also to debug the issue at hand down to this point.
I also see how there is two issues here, one regarding verbose-object-parsing-errors
not working in gix-object
, and the other about trees not being able to be parsed.
For the first of the two, a test was always missing, presumably, so it was missed during the winnow
conversion, and I am already curious what surprise could lead to trees not being parsed correctly. After all, they truly have the simplest format.
I will definitely keep you posted about the progress on this matter.
Could you also elaborate where find_tree
is called, or where in NoseyParker it runs into this issue?
I am asking because doing so allocates a vector, and maybe the Iter
version of this will suffice as well to avoid the allocation - it could be 2.5x faster then.
The fix is now available in the linked PR, and it should land in main
soon as well.
It would be great if you could validate it's working for all repositories now, and if not I am happy for more samples that won't parse.
Thank you.
Could you also elaborate where
find_tree
is called, or where in NoseyParker it runs into this issue?I am asking because doing so allocates a vector, and maybe the
Iter
version of this will suffice as well to avoid the allocation - it could be 2.5x faster then.
The specific call is in here. The code iterates over the object IDs in the ODB, reads each header, and switches on object type. In the tree-handling case, odb.find_tree
is called using a scratch buffer that is allocated outside the loop.
What's this Iter
version you mention?
Some background:
Nosey Parker detects secrets in Git repos by scanning each blob.
Since its v0.14.0 release in August, it now additionally investigates commit and tree objects within Git repositories in order to calculate detailed metadata for each blob it scans. This metadata for each blob includes the commit that first introduced it and the pathname it appeared with. It uses a novel graph algorithm (?) to efficiently determine this metadata for all blobs, many orders of magnitude faster than you can with regular git
commands.
The big idea in the algorithm is to build an in-memory representation of the inverted Git commit graph (allowing quick lookup of child commits), then traverse that in topological order, determining for each commit the set of blobs that were first introduced there, with respect to all other commits in the repository.
The implementation uses an adaptation of Kahn’s algorithm for topological traversal, combined with a priority queue ordered by commit node out-degree. The priority queue approach heuristically minimizes memory use by keeping the work list of nodes that still need to be visited small. Additionally, the implementation uses data representations that provide good CPU cache behavior and constant factors, such as compressed bitmaps to keep track of which blobs have been seen at each commit.
@Byron thank you for the very fast response and patch!
What was the root of the problem? Was that tree object badly formed somehow?
I've tested Nosey Parker with the main
branch of gix, and that reduces the number of warnings about trees that failed to parse, but I still see a few.
From https://github.com/xilense/Home-AssistantConfig:
04b414cb43033d8a97ef92100f4af19bb419f88f
06d9a45dcfcf0d938240f43fefc9990a25efee80
0839676ca0bc4c5bbe6d8ce772cece8e316b9e8f
0a5965092a8cf976fc984a04b608daa7a1fad05d
0a5fe8ec7728b76fc69c408d2de657bc01f8ec9f
0bcdafad3d373b39f3cfa3e1278ae2d05e353ca8
0ceecead3b46d9f8c2d753cb4bee4957955f3abf
160b4512a93583b8bcc3bb1064d2b1eb51759c7a
1847f1b0b4a6424cb15296f5505acd2720d8eeba
1e3e1343e99fd8c4448357d5fb1a82427e374790
1f2ce61f8db31c6284572509fb8ea11932aa4779
27b6a959ab9c3ca50fe9474bb6280ab339a7ea6f
280d1cd88aab93e3d80098340e2b9dc653a9087a
34018453297e23d63b0ea6ee5faa8a2ff117cea6
35cb2329ffbd1a02f18423026ef1c2f6aeb9607e
3ebb555922bffdcabb8b5e3695ab4dd379d1786f
4dc8ad60994fdaf720ad11b6e6fc9edb9b8f2e6f
4f71b80898f86e87f909aa3910560430c1f87734
4fc02d36b9ada1a540ae7c7a391fd0d7b6f1aff1
523751ac99f5c0569665a7023c9a10ee920126b3
56cd766a03d309033d3e8f7461413ca94f3cfa9d
604d88d7dcbd09830b76ea28408d950cdd139eec
6319b401421a883e741aa0a560dd19f1f71201ab
6aaa446307c99419b58ed044f7a2e10291bcd3f2
6c398b39f707fac3abdedcc8035238cf9d87215f
6e34f8682997ec0374698b9c3e5b6bac45d1c9b0
6fa897cc55b09d2d6bb7db8a97b1359d1aee4339
7416819d785bd3a49a9ad0ad7b3739c4e3a28210
775a527b4797a47b417a90ac4e34d90082f6a296
8d8d38da677ac2f5ab89d9a95a1431d69a40fa93
90a1ba6add24929060e98b218f925f045bf3b349
9add112007d208f22dffaeef93a9e613f0706063
a63b18121c403b16d4a1e1657f564ef2566004b7
a890291c9b89ef432b3819271c0dab028f12128c
aad6c296c49388f83214315017118d54fe461e45
ad10501c1759f0f2be86dc3dae4d8f260b32af35
b080b4150d566f32995446e93b186cb1fc1a6791
b18418a478e08aa1dafbcd8bf377adb539575c8f
b25522729c3edf7a672f4112457b3736e221671f
bfbf77661b6ed14cbdd931e078565fb1b5881aa1
c65a59a3174e365170bd8ac6978d07db0adf7412
cd85c0cd9b01a99ad95368d5e0ebaca500fe1363
cf6b28bb633b430ee7271eaaff503df4c9e221d9
da1eebe232c6712fe6c05c5e5c1785951699121e
da80d3e09a73967abb833f9992361dcfe9177081
da87ef555b3493e7e6c423816ed5384d751cdf68
de2f41917e7909561c44d1570fb31da1b9dcdb2b
e068f5ec97fc2d5b0fce7148bab83129ac3324e2
e41c01d52df382aea7ad1991965bb6391932f6db
e8fa1098f8df5575999eb33a4433d7cfc4572fe1
ee35bba3ae50d1f5ac89138d45969a1d070a4264
f196ff373d8c07be2529d81c038f7115ebd58c2f
f8b8fe751b6d038edc381571f00fc65eb902a0b0
f938fa761e46b85a8a3cd847b05ee6400e43eb7c
feb85c0e834ef430434d0331ee0fd9448bd6c4c9
From https://github.com/akram/openclassifieds2:
4fc35139d3bb25c2108e45442ba8064dbefbe22a
656aafda80e497a3dd833d4b9d7854b03829aa97
bb7e32fd8be7e0ed0c9e5440789c050622c715fa
f62f6c1510e608a99a34cfb4323039522fe38df4
The specific call is in here. The code iterates over the object IDs in the ODB, reads each header, and switches on object type. In the tree-handling case,
odb.find_tree
is called using a scratch buffer that is allocated outside the loop.What's this
Iter
version you mention?
In this line it shows that ultimately the code wants to iterate entries, so there is no need to allocate a Vec
of these.
Instead, TreeRefIter
can be used to parse each entry on the fly, which also has the advantage that you'd get at least some of the entries until the first parse failure.
Thanks so much for sharing the algorithmic approach that's implemented in NoseyParker, I am loving it! Even though it's way above my head, it gives me hope that some of these algorithms can also be used to speed up gitoxide
in some places where they are applicable.
What was the root of the problem? Was that tree object badly formed somehow?
I don't know actually. There were two parser implementations, one with winnow
that is based on the nom
version before it, and a fast-path which is just a manual implementation. The latter worked out of the box, so I just removed the former entirely. Trees are very simple objects and probably best parsed 'by hand'.
Thus I am curious what these other failing trees look like - I will let you know once I know more.
I did the fix properly this time and fixed another shortcoming that would have bitten some other time, I am sure. With that debt removed, I'd hope these kinds of failures are nothing you will see ever again. Let's see what you find out though :D.
I have updated a local development copy of Nosey Parker to use the latest gix
from main
, with your additional fix in 8d05cae. I had to make a small change there to get the code compiling again for the small API changes (EntryKind
vs EntryMode
).
Anyway, I ran over my 7500 input repos again, and this time saw no tree parsing errors. Thank you @Byron! I'll get those changes on the next gix
release.
I am glad to hear that!
I had to make a small change there to get the code compiling again for the small API changes (
EntryKind
vsEntryMode
).
Yes, those really had to happen to remove the hopefully last 'wart' from one of the earliest crates in the family. These sometimes have terrible shortcuts in them as back then, my own quality standards for what was a pet-project were much lower.
I've also switched to using odb.find_tree_ref_iter
instead of odb.find_tree
. Every performance-conscious detail like this matters!
Absolutely! If you ever run into a profile that indicates something in gix
can be improved, please don't hesitate to let me know or contribute the improvement right away - wasting resources is something gix
definitely doesn't fancy.
Current behavior 😯
There are certain Git repositories that contain tree objects that
gix
fails to decode, but whichgit
itself has no trouble with.For example, from a test program included in this issue:
Additionally, it appears that the
verbose-object-parsing-errors
feature ofgix-object
is broken when usinggix
newer than 0.51.0 andgix-object
newer than 0.33. Rather than getting a detailed error message when an object fails to parse, the error message isNone
, or rendered as an empty string. This seems to coincide with switching away fromnom
for object parsing.Expected behavior 🤔
gix
should correctly parse the tree objects.Git behavior
git
seems to have no trouble!Steps to reproduce 🕹
For example, see https://github.com/jsbin/jsbin, which has a few problematic trees:
5e22e05094a0626934cb2f7319e92da2d343e726
7dde39c37eea1cc6f81a04ba50e2ef55333c6966
ad20e1886a324465093f656a0910b9fe00429977
e0d917771d3c3ffb7782158d9f33b2b7e9c6c524
Locally, I cloned the repo to
/disk1/clones/jsbin/jsbin
usinggit clone --mirror
.We can inspect the second problematic tree (
7dde39c37eea1cc6f81a04ba50e2ef55333c6966
) with Git. It is the root tree of this commit: https://github.com/jsbin/jsbin/commit/884c72e9e45cff3da50da80afcc12e6f5e523361Git says its payload is 659 bytes long:
I wrote a small
gix
test program that demonstrates the problem.Cargo.toml:
src/main.rs:
Running this for the second bad tree object noted above, we get this: