Config Parse/Read/Write Discussion

GitoxideLabs / gitoxide

An idiomatic, lean, fast & safe pure Rust implementation of Git

Apache License 2.0

8.99k stars 310 forks source link

Config Parse/Read/Write Discussion #31

Closed crutchcorn closed 3 years ago

crutchcorn commented 3 years ago

Hi! I'm a developer that's been tasked to implement GitOxide into a WASM stack in an effort to clone a project as fast as possible in the browser.

As part of that work, I'm looking to introduce a lot of new functionality into the project to get clone working. As an initial task, I'm hoping to get the git-config package able to read and write config files. I've got a (very) WIP port of config parsing logic from isomorphic-git over on my fork.

I won't bother asking for a code review right now, I'm brand new to Rust and have a colleague taking a look at the code right now to give me some pointers on how I can improve my work

That said, I am unsure of what direction we want to go in with regards to data organization. The current output is a pretty unmanageable tangled vector of Sections and Entries alike in a single struct

I've noticed that the current git-config file has both Section and Entry types that look to contain all of the relevant properties, so I'll be refactoring my current code to use that data structure.

Outside of that, I was hoping you could give me some kind of direction when it comes to the different mods in the git-config/lib.rs file:

https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L57-L72

https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L131-L217

I'm not entirely sure what borrowed or spawned is referring to (which I admit as a point-of-noobiness), and I'm not sure I track what Span is doing here.

I also would love some insight as to what a Token is referring to here:

https://github.com/Byron/gitoxide/blob/main/git-config/src/file.rs#L4-L23

As I'm not sure an instance that would have a Section and only a single Entry

Looking forward to learning more and being able to meaningfully contribute upstream!

Byron commented 3 years ago

Hi Corbin,

it’s great to meet you and I will do my best to support your work. As I am currently traveling at 300km through China I won’t get to answering the questions right now, but will do in a few hours.

Also I am thinking it might be best to setup a zoom call (or equivalent) to get you started more quickly. As you saw I already laid out the structure, wanting to add the implementation next. Maybe it might be easiest to do that together even so you can chime in and potentially take it from there. That could be a great entry point for some Rust mentoring too, in case you are interested.

Talk more in a bit 😊

Sent from my iPhone

On Nov 30, 2020, at 1:01 PM, Corbin Crutchley notifications@github.com wrote:

Hi! I'm a developer that's been tasked to implement GitOxide into a WASM stack in an effort to clone a project as fast as possible in the browser.

As part of that work, I'm looking to introduce a lot of new functionality into the project to get clone working. As an initial task, I'm hoping to get the git-config package able to read and write config files. I've got a (very) WIP port of config parsing logic from isomorphic-git over on my fork.

I won't bother asking for a code review right now, I'm brand new to Rust and have a colleague taking a look at the code right now to give me some pointers on how I can improve my work

That said, I am unsure of what direction we want to go in with regards to data organization. The current output is a pretty unmanageable tangled vector of Sections and Entries alike in a single struct

I've noticed that the current git-config file has both Section and Entry types that look to contain all of the relevant properties, so I'll be refactoring my current code to use that data structure.

Outside of that, I was hoping you could give me some kind of direction when it comes to the different mods in the git-config/lib.rs file:

https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L57-L72

https://github.com/Byron/gitoxide/blob/main/git-config/src/lib.rs#L131-L217

I'm not entirely sure what borrowed or spawned is referring to (which I admit as a point-of-noobiness), and I'm not sure I track what Span is doing here.

I also would love some insight as to what a Token is referring to here:

https://github.com/Byron/gitoxide/blob/main/git-config/src/file.rs#L4-L23

As I'm not sure an instance that would have a Section and only a single Entry

Looking forward to learning more and being able to meaningfully contribute upstream!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

crutchcorn commented 3 years ago

Entirely understandable. Honestly, I'm surprised you responded as quickly as you have at all!

I was thinking the same thing about setting up a zoom (or similar) call! I would love to work together to add the implementation - I want to make sure the code standards are up-to-snuff with the rest of the package and would absolutely love some mentoring on my Rust journey!

If we'd like to schedule a meeting privately, I can be reached at {myGitHubUsername} at gmail dot com. Otherwise, if you'd rather, I'm available via DMs on my Twitter

Just as a heads up, our schedule is way different, as I'm located in California, USA, but I'd be happy to modify my schedule and do a (for me) very early morning/late night call in order to make this work.

Thanks so much for being so open and welcoming! Looking forward to working together!

Byron commented 3 years ago

Alright, I thought it might be easiest to add comments in code, a commit should show up in this thread for you to look at. It's definitely not perfect but might already help a little.

Truth to be told, I think I have pretty much sketched out something that I believe I could implement when doing zero-copy parsing using rust-dangerous. The crates author is super experienced with parsers and very nice and responsive, bringing the crate to a point where it would be perfect for gitoxide because of it's ease-of-use and its ability to create nice parsing errors out of the box.

My vision for the implementation is definitely not the easiest, but it's designed to be en-par with the original git implementation which has a lot of bells and whistles, out of necessity I suppose. For instance, it's non-destructive for the most part even in the presence of comments.

Using regex for parsing is a none-starter, which unfortunately rules out a port of isomorphic-git, the reason being that the regex crate bloats any binary and takes a long time to compile which again, would not be competitive enough with the existing git-config implementation in C.

Of course I hope that this doesn't scare you, because if I can do it, so can you. For a lack of time I can't propose some times for a Zoom call right now, but would hope you drop me some suggestions (to the email address I use in commits) of times that would work for you (I am UTC+8), any remaining day this week.

Thanks a bunch!

crutchcorn commented 3 years ago

Love the extra comments, they help a lot.

I think the zero-copy parsing is certainly the way to go. Non-destructive and more performant are huge reasons for going that route.

I am under a bit of a deadline for an initial draft of the cloning work (although I have time afterward to perf optimize), so I'll likely be continuing the development of my regex implementation separately, and working to get the zero-copy parsing implementation mainlined during the same time.

I'll be sending out an email right now to try to figure out a time that would work best for a sync. :D

edward-shen commented 3 years ago

Hey folks, I actually started writing a git-config crate myself, and was going to try a find a crate name until I stumbled that gitoxide had already started one. From the 0.0.0 version tag, I assume this is just a crate name reservation.

For what it's worth, my parser is currently zero-copy and guarantees emitting enough events to identically reconstruct the source git config file, and currently depends only on nom (although I'm sure we can parse out the dependencies if needed).

I'm currently writing a higher level abstraction on top of the parse that allows for semi-efficient insertion and deletion to modify git config files in place.

Would the gitoxide team be interested in merging the work?

Byron commented 3 years ago

Hi @edward-shen, thanks for sharing your work! I am quite baffled by it's completeness and amazed that it integrates with serde. The last time I deep dived into serde my brain caught fire.

Since the implementation here is merely an API sketch along with a 'sample' of a parser using dangerous there is definitely interest in merging your work.

There reason for gitoxide choosing dangerous over nom even though the latter is already used for parsing git objects is the vastly superior error messages it can produce with little effort (by now it even integrates with nom errors to the extend possible).

It's used to parse bytes directly due to git itself never specifying the use of valid unicode and seems to permit pretty much 'anything' as values at least in theory. Thus I never dared to use str but chose BStr instead. That way there won't be any surprises.

The idea was to provide better error messages than git itself in case of malformed files while making it easy (maybe even with a compile flag) to allow other encodings in values or section names - right now these apparently have to be ascii/latin character sets (my knowledge about this is just based on a few manual tests though).

I would be very interested in your thoughts on where the current implelementation/ideas of gitoxide differ from yours. Thanks a lot!

edward-shen commented 3 years ago

Hey @Byron! Glad to hear that the team is interested. I should first clarify that while the crate is named serde-git-config, I haven't hooked up the serde portions yet. Serde explicitly doesn't handle parsing, so I've been currently working on a functional parser first before working working on serde integration. That being said, I am experienced with serde, so that should be coming soon once the higher level abstractions are done.

There reason for gitoxide choosing dangerous over nom even though the latter is already used for parsing git objects is the vastly superior error messages it can produce with little effort (by now it even integrates with nom errors to the extend possible).

I chose nom because the combinatorial parsing strategy was intuitive for me to conceptually understand, and made testing very easy for me. It was very easy for me to quickly debug and unit test, which helpful in getting a parser quickly. That being said, I'm sure we can do some similar strategy for dangerous as well—I just didn't know about it at the time.

Currently, error messages in my work are very scuffed—at best you'll just get an error message of where the parsing fails, but that's an improvement for when more functionality is done. That being said, since we do know where the parsing fails, giving an meaningful error message would just be as simple as bubbling what sub-parser failed at what input, which we can derive display from easily.

It's used to parse bytes directly due to git itself never specifying the use of valid unicode and seems to permit pretty much 'anything' as values at least in theory. Thus I never dared to use str but chose BStr instead. That way there won't be any surprises.

This is a good point that I haven't considered, but I think it's fine to be more restrictive than git in this case. Considering the use case of git-config, I'd doubt people intentionally would want to put something like binary data in a git-config file, and if there is something like that, my opinion is that it's more likely a mistake than something that the user intended. This might be the largest difference in the goals between the two works (semi-opinionated versus universal).

I would be very interested in your thoughts on where the current implelementation/ideas of gitoxide differ from yours

For a little more context, my currently architecture is a two layer implementation, with a lower-level parser powering a higher level wrapper. The lower-level parser only handles emitting text events, while the wrapper itself handles reading and modifying that stream of events. This lets us expose both layers in case users want to use the lower level parser instead of the abstraction. This seems the primary difference in the architecture, since the File implementation in gitoxide seems to do both.

The current "roadmap" for my implementation would be:

implement stable set_value for GitConfig
implement serialization for GitConfig
adding sections anywhere in the gitconfig
serde support
error handling
Even higher level abstractions (e.g. struct for global, user-local, and directory-local gitconfigs given some path)

And probably after all that is done would I feel confident releasing a 1.0 version.

On a side note, is there a reason why you have deny(rust_2018_idioms) in your implementation? Are you trying to hit a specific minimum supported rust version?

Byron commented 3 years ago

Apologies in advance if my reply seems a bit short or concise, it's really that I am short in time but try not to wait too long with my reply here. It would be great to find a way to benefit from your implementation which looked like it adheres to high quality standards.

Currently, error messages in my work are very scuffed—at best you'll just get an error message of where the parsing fails, but that's an improvement for when more functionality is done. That being said, since we do know where the parsing fails, giving an meaningful error message would just be as simple as bubbling what sub-parser failed at what input, which we can derive display from easily.

From my experience with the nom error handling I would expect this to not be super trivial, but from what I have seen so far I would put my money on you being able to pull it off anyway! dangerous was a god sent when I discovered it on Reddit and from that point on I managed to bounce some ideas back and forth with the author who has been super responsive and helpful making it this encounter range among the best open source experiences I have had thus far.

Having helpful error messages is a must for the gitoxide implementation, git isn't so bad at this either.

This is a good point that I haven't considered, but I think it's fine to be more restrictive than git in this case.

Here I would be very conservative and assume that all kinds of potentially uncommon encodings are produced by people editing their config files somewhere in the world, so being more restrictive than git makes it likely that some of them wouldn't be able to use gitoxde on their projects. This is something I want to avoid at if feasible, and here I think it is.

This lets us expose both layers in case users want to use the lower level parser instead of the abstraction. This seems the primary difference in the architecture, since the File implementation in gitoxide seems to do both.

That's true, the parser here is considered an implementation detail. Exposing it is nothing I would be too afraid of though as I am OK incrementing major version numbers liberally on low level crates.

implement serialization for GitConfig

That's something I very much look forward to! It's nothing that gitoxide needs, but it's just cool to be able to serialize to the gitconfig format nonetheless. Besides, it's probably quite convenient to declare configuration naturally with structs, too.

Even higher level abstractions (e.g. struct for global, user-local, and directory-local gitconfigs given some path)

Even though it's not 'sketched' yet I also want a combined Config kind of type which allows access to the git config hierarchy. So that would be a perfect match, too.

And probably after all that is done would I feel confident releasing a 1.0 version.

Thus far there is only one real 'blocker' which is the usage of str instead of parsing bytes. With BStr the (testing) experience is equivalent to using str in case you would consider changing this. However, I do understand it's a big ask especially since having error handling could still block merging if it doesn't end up a little better than the one git provides already with nom possibly being a risk in that regard.

On a side note, is there a reason why you have deny(rust_2018_idioms) in your implementation? Are you trying to hit a specific minimum supported rust version?

I have seen this being used in a crate I love, smol, and adopted it without hesitation. In practice this meant that lifetimes cannot be omitted anymore in structs that borrow when used in function arguments, making code more explicit and readable.

Reading my ramblings, here is the summary:

I would love to merge your work, make you a member of the gitoxide team and see you as the maintainer of the git-config crate.
Not being able to making error handling competitive with git is a risk
Forcing parsing to UTF-8 compatible seems like the only serious blocker. Probably I should consider removing this constraint in favour of having a great and probably fine implementation much sooner than would otherwise be possible, but I am not quite ready to do that yet.

Anything I said should not be understood as discouragement or lack of confidence, as I can only restate that I am excited about the implementation and would love to make it a part of gitoxide one way or another.

That ended up being longer than anticipated, I hope you can find it useful in a way that helps making this work :).

edward-shen commented 3 years ago

No need to rush—I just didn't want us to have duplicated work. Considering the breadth of what you do, it's understandable that you're very short on time, so I definitely appreciate the thought out response.

Having helpful error messages is a must for the gitoxide implementation, git isn't so bad at this either.

I definitely agree, and it's definitely a shortfall in my current implementation that can be worked on. I don't mind using a different crate for parsing, but considering how the parser is already completed in nom, it would be a lot easier to migrate if I can incrementally move things away nom. As I'm not familiar with dangerous, how easy is it to migrate from a nom parser? Alternatively, we could first try to get nice error messages with just nom if migration is difficult?

Thus far there is only one real 'blocker' which is the usage of str instead of parsing bytes.

I'm not too strongly for or against using BStr, so I don't mind using it. If anything, we can always provide a UTF-8 restrictive entrypoint that just accepts a &str instead but uses a bstr as the underlying implementation. I don't think my parser is strictly dependent on any utf-8 feature (if anything, I think it makes it more difficult), so I don't mind moving towards this.

I think I talked myself into supporting Bstr, haha.

Anything I said should not be understood as discouragement or lack of confidence, as I can only restate that I am excited about the implementation and would love to make it a part of gitoxide one way or another.

Don't worry, I haven't gotten that impression at all. Just sounds like a maintainer doing due diligence and getting motivations and goals cleared up ;)

If my responses look good to you, joining sounds good! How would you like to move forward then? I don't mind working on this in my own repo until I get some minimal viable product ready to merge, or work in a branch in the main repo. Whatever works for you!

Edit: Parser now uses BStr, and the Config now accepts any input that can be converted into a BStr :)

Byron commented 3 years ago

Considering the breadth of what you do, it's understandable that you're very short on time, so I definitely appreciate the thought out response.

It's quite incredible how scarce focus and uninterrupted keyboard time have become thees days and writing complex code like…before… is something I can only dream about.

I definitely agree, and it's definitely a shortfall in my current implementation that can be worked on. I don't mind using a different crate for parsing, but considering how the parser is already completed in nom, it would be a lot easier to migrate if I can incrementally move things away nom. As I'm not familiar with dangerous, how easy is it to migrate from a nom parser? Alternatively, we could first try to get nice error messages with just nom if migration is difficult?

This is the best time to see if @avitex , the author of dangerous, has any experience they could share. I have been in many conversations with them which has cemented the knowledge that it's the right crate to use because it's just so good and fast moving to getting even better (and towards 1.0).

Personally I think it would end up being a rewrite, especially since without the use of the zc crate the dangerous::Span type would be required for tokens instead of BStr to avoid self-referential structs. To my mind doing that doesn't add much complexity as the API sketch seems to indicate.

As using dangerous over nom effectively hinges on better out-of-the-box error messages (and by now, my unapologetic fandom :D) I would say it's worth trying to get errors to look nice and be helpful with nom first.

When done right, dangerous errors provide a nice human readable parser stack like 'when reading section foo, the fifth entry's value could not be decoded' (not quite there yet, but it's achievable with the information that's present), and getting this with nom is probably hard to the point where the advantages of nom diminish. But…I also don't really know and I am biased. What I am trying to say is that getting there with nom might be hard and before you run away I'd rather cut back on the quality of parsing errors.

@avitex has done tremendous work reviewing my first steps towards implementing a parser (see https://github.com/Byron/gitoxide/pull/40 and https://github.com/Byron/gitoxide/pull/44), and I feel a bit guilty to think about adopting nom if its good enough even though that's clearly a pragmatic choice.

I think I talked myself into supporting Bstr, haha.

Neat! Converting from BStr to str and dealing with potential failures is very well supported by its API already, so there is no need for an additional API for that.

Parser now uses BStr, and the Config now accepts any input that can be converted into a BStr :)

🙌

If my responses look good to you, joining sounds good! How would you like to move forward then? I don't mind working on this in my own repo until I get some minimal viable product ready to merge, or work in a branch in the main repo. Whatever works for you!

Since you do all the work really I want to do my very best to not add any hurdles and actually slow your progress or enjoyment developing something that essential to gitoxide and the world :).

Here is some suggestions and leading questions:

take a look at the current git-config API sketch and see how it resonates with you or your vision. Maybe it's something to partially adopt or to improve upon.
How would self-referential structs work when &BStr is used in tokens? Is there a way to avoid that entirely?
Is an MVP without Serde support possible? gitoxide wouldn't need it, and for it to be mergeable it would have to go behind a feature toggle as serde wouldn't be a dependency otherwise.
If a draft PR replacing the current git-config implementation is submitted, feedback can be received early by me and possibly @avitex.
Once there is a PR which demonstrates error messages at least en-par with the ones of git, it could be merged early as MVP. Then you would become a collaborator to this repository, sole author and co-owner of the git-config crate on crates.io. (Moving gitoxide into an organization is definitely an option eventually, too). From there you could develop on main or submit PRs for feedback, as usual.

No matter what happens, I firmly believe that this conversation alone is of great value - in the best case gitoxide gains a state-of-the-art implementation of the git config API surface, let alone having you join the roster of maintainers to help build something quite unique and, one may hope, of historical relevance 😅.

edward-shen commented 3 years ago

I took a deeper look at the API sketch and there's an incredible number of similarities between what we have. I think the primary divergent idea would be how the gitconfig editing occurs—the sketch seems to use Edit which applies to a file once all edits are done, while my implementation applies the edits as you make them (Not to the file, but to it's internal representation). It's certainly an approach that I haven't thought of, but it definitely has benefits over my implementation, since you would be able to debug the list of changes that were going to be applied to a file.

The primary concern with that approach though would be handling multivars, which were a pain point for me. I'm not sure how the user would be able to control which instance of the key/value pair would be set through that appraoch.

How would self-referential structs work when &BStr is used in tokens? Is there a way to avoid that entirely?

This was actually a problem I ran into, since the parser actually only returned references. The solution for me to use Cow<'a, BStr> for all event outputs, even if the parser emits borrowed events only. As a result, higher level abstractions (such as GitConfig, or File in your sketch) and insert borrowed or owned values without any problems, thus avoiding the need for self referential structs.

This somewhat leaks implementation detail of higher level abstractions into lower ones, but I think it's better than to have an owned versus copied event enums.

Is an MVP without serde possible?

Yes, of course! I've already made serde an optional feature.

Once there is a PR which demonstrates error messages at least en-par with the ones of git

Now that I think about it, I haven't really seen any meaningful errors from git-config. Do you mind posting a few examples? The only one I've really gotten while working on this were something akin to fatal: bad config line 21 in file <>, but I assume you meant something else?

As a side note, I've been working on getting stuff well tested. After 16 million fuzzer iterations without error, I think I'm semi-confident that my implementation is panic-free, but I'm going to leave it running overnight to make sure. ;)

Byron commented 3 years ago

Thanks for all your effort - it's more appreciated than has been expressed.

The primary concern with that approach though would be handling multivars, which were a pain point for me. I'm not sure how the user would be able to control which instance of the key/value pair would be set through that appraoch.

Even though the API might be lacking, unintuitive or incorrect when actually trying to implement it, the idea is that every owned entry is either new or was created from an existing borrowed entry. This allows these updates to know where they belong. Ultimately borrowed sections or entries know their index in the list of parsed tokens, which is immutable.

Doing edits by pre-recording them and applying them during serialization certainly is complex and maybe could be simplified by letting Edits keep a mutable copy of all tokens. Definitely something to consider in an MVP.

In short: should work, but who knows 😅.

Please note that Edit types are required because nothing is ever owned in this implementation. Values are transformed/decoded only on demand, which is where Cows come into play, whereas new values need their own type to be representable.

No lookup tables or acceleration structures are used to avoid allocations where possible and I take a wait-and-see stance to add them only when there is actual need.

This was actually a problem I ran into, since the parser actually only returned references. The solution for me to use Cow<'a, BStr> for all event outputs, even if the parser emits borrowed events only. As a result, higher level abstractions (such as GitConfig, or File in your sketch) and insert borrowed or owned values without any problems, thus avoiding the need for self referential structs.

I truly believe this delays the inevitable problem: Either each access to git configuration creates a temporary GitConfig to keep the values on the stack, avoiding self-referential structures, or something has to keep both, the buffers read from disk as well as the parse result.

In the sketch that's the File abstraction, whereas your version avoids this, pushing the issue further down the line. The rental crate has nice documentation for this as well as a way to workaround it.

Being able to parse once and keep the buffers is the reason the sketched API looks the way it looks, a lot of effort it creates for itself to avoid allocations and keep things minimal in memory. git itself doesn't even do that and seems to allocate everything, like String everywhere :). Maybe it's a bit overdone, but I think doing that gives gitoxide an edge over git further adding to its relevancy. With some excitement I can say that all git object parsing is done using a similar technique, separating borrowed and owned versions, making parsing nearly free and so fast that it leaves git in the dust. When looking at their implementation you will see a lot of allocations happening to parse objects, and special methods to parse only parts of it to accelerate hot code. In Rust that's not needed, while being safe to use (after all, handing out references/pointers in C is probably difficult to manage).

Anyway, ranting :D, but in short a good test would be to try keeping the buffer within GitConfig.

Now that I think about it, I haven't really seen any meaningful errors from git-config. Do you mind posting a few examples? The only one I've really gotten while working on this were something akin to fatal: bad config line 21 in file <>, but I assume you meant something else?

Yes, 'only' these would be the minimum, and with nom something like it shouldn't be a problem either I suppose. Maybe also with nom one can be more specific to point at a specific sub-token for instance, but to me that's entirely optional.

As a side note, I've been working on getting stuff well tested. After 16 million fuzzer iterations without error, I think I'm semi-confident that my implementation is panic-free, but I'm going to leave it running overnight to make sure. ;)

Pretty neat! I myself never got around using an actual fuzzer, but resorted to 'stress' tests that verify big repositories. A few issues were found that way, but of course, tells nothing about malformed input.

I hope this helps.