PoiScript / orgize

A Rust library for parsing org-mode files.
https://poiscript.github.io/orgize/
MIT License
278 stars 34 forks source link

discussion: drop indextree #18

Closed PoiScript closed 7 months ago

calmofthestorm commented 4 years ago

Continuing per #13 , I don't have a strong opinion on the use of indextree as a user of this library. Despite having some exposure in APIs, I have been able to ignore it as an implementation detail.

The reason I want to own some of the elements is simply to use them in my own logic, which means being able to easily pass them, put them in structs, etc. That is annoying when lifetimes are involved, and these are often small structs (such as Timestamps) where the clone is cheap.

The indextree has not been a problem at all; what has been frustrating is the semantics where Orgize doesn't clone any of the strings, and thus every element, down to the smallest, needs a lifetime tying it to the Org and to the source string (though #8 solved most of this). This is an important feature for performance, and exhibits the typical design I've seen in Rust where APIs need to leak a bit more complexity (e.g., passing the Org into most methods), but in exchange, the user gets significantly better performance with safety. I think that parse_string and #13 address most of my concerns here, however.

With #13 , I now have the ability to simply clone any element I want into a value with a static lifetime, which solves my problem. I get the benefits of Orgize itself making minimal copies, while being able to be lazy and clone if I want to make that tradeoff. This seems fine.

I have not yet tried modifying the tree structure much with Orgize, mostly just tweaking timestamps and the like, so I don't have much experience with the mutating API. But Headline has no lifetime, and has wrappers that allow mutation, allowing me to ignore the somewhat confusing indextree API almost completely.

How would it work if you got rid of indextree? Would you use Rc/Arc/weak? Your own structure that provides some analogue to NodeId? Nodes owning their children directly?

IMO, owning their children directly is a nightmare for any mutating API, unless it is written as a side-effect free API (i.e., every tree edit command returns a new tree). I don't like that for Org because it is convenient to be able to, e.g., hold handles into nodes in the tree and change them.

My opinion is that ultimately, Org files are small text files, so performance is not a huge concern. That said, Org libraries are often used to script complex changes which may not get great test coverage by people who are doing this in a spare moment here and there, rather than building dedicated libraries. My reason for moving off Python was not poor performance; it was blazingly fast compared to elisp. I missed a bug in the parser for the longest time that caused it to parse the files n^2 times (each file each file) rather than just once each, because it was still* so fast compared to the problem at hand. My reason was that I got sick of trying to write reliable software in Python. So if I were writing an Org parser, I'd personally prioritize ease of use over performance. I don't think indextree especially hurts ease of use, but I do think that something like nodes owning their children directly would.

So basically I think it's largely up to you as the implementer -- internally, has indextree been convenient, or a pain? I don't think it's adding much complexity to the API, and if it can prevent the introduction of more lifetimes into the API, that itself is a big win IMO.

PoiScript commented 4 years ago

Sorry for the delay, and thank you for the long and detailed feedback.

I can see the biggest concern you have is about lifetime. And I'm afraid it's not much related to indextree: as long as the Element struct contains a lifetime parameter, so does the arena.

So basically I think it's largely up to you as the implementer -- internally, has indextree been convenient, or a pain?

I wouldn't say it's plain to use indextree - it's definitely a good crate but just not suitable in some situations.

The biggest concern I have is lack of ability to type check the whole Org struct. For example, you may append a Paragraph node to a Text node by mistake and it'll eventually break anything like rendering. To prevent that, we have a validate function to ensure everything in its right place. But there're still some check I haven't get a choice to add and that can still lead to some potential bugs.

IMO, owning their children directly is a nightmare for any mutating API, unless it is written as a side-effect free API (i.e., every tree edit command returns a new tree).

I'm a little bit confused about this line. If you're really need to return a new tree, you can simply clone a new one before applying any changes.

How would it work if you got rid of indextree? Would you use Rc/Arc/weak? Your own structure that provides some analogue to NodeId? Nodes owning their children directly?

One idea I had is nested struct, it is verbose, but simple and straightforward. For example,

struct Headline {
    level: usize,
    title: Vec<TitleContent>,
    section: Vec<SectionContent>,
    contents: Vec<Headline>,
    // ....
}

enum TitleContent {
    Text(Text),
    Bold(Bold),
    // ....
}

enum SectionContent {
    Paragraph(Paragraph),
    Table(Table),
    // ....
}

I'm not sure if this is what you called "nodes owning their children directly". If so, why would you say it "hurts ease of use"?

After all, I haven't made a decision yet. To be honest, I don't have much free time recently, so I don't want to work on this too soon until we find a clear solution.

calmofthestorm commented 4 years ago

The changes you've made (parse_string and merging my Clone PR) have solved my lifetime issues. I can let Org manage the string, and clone to give any specific parts I care about a static lifetime. It's also great to be able to do the parse and process with Org borrowing my string and not needing to copy any of it.

I do want to be clear, since you note you don't have a ton of free time and want a clear solution, that I'm not sure this is really a problem. I don't think you need to redesign the API to provide strong typing for this to be a useful library. I think that a validate function to check that each element is still that element (or, stronger, that writing it and then reparsing it gives the same/similar value) is good enough. I can give my thoughts because it's interesting to discuss, but I think the API is fine as is.

Strongly typing an AST is a neat idea, and I can think of a few ways (the nested structs you mention; another option is a builder pattern where building does the type check, and then the built struct is immutable to the user, and thus guaranteed to always be valid without needing to encode all mutations as strongly typed).

But I also think it's not a good philosophy for Org mode, where the spec is simple, ambiguous in places, and org-mode isn't even consistent with itself in many cases, using different code paths for different commands (e.g., keywords are case insensitive for a sparse tree, but case sensitive for org-agenda). The philosophy of org-mode is that it will just accept whatever you throw at it and do its best to make it useful.

I'll grant that if you treat org-element as the spec it may be possible to be completely unambiguous, but there will still be inconsistencies with org-mode, and not just weird edge cases. For example * Hello[#A]World will be treated by org-mode as having valid priority cookies. org-priority-up will change them, org-agenda will sort by them, etc.

So as a user, I have the concern that even a completely unambiguous, strongly typed AST, may not do a good job of capturing my files, and it would also be a lot of work, and could lead to a more complex API, since now every mutation has to go from valid tree -> valid tree. Even with just the headline, that can be frustrating -- if I create a new empty headline, I must set the title before I set its tags, since per https://github.com/PoiScript/orgize/issues/17 tags with an empty title is invalid.

I think that strong typing, or at least rigorous validation, is a good idea for the context-free parts of the grammar. Prevent users from turning a headline into two by accident, or introducing invalid level structure, removing the *** so a headline becomes not a headline, etc. Then be a bit more flexible in how you handle the context parts in sections, etc. Accept that the parser (and user code) will never be bug free for anything that complicated, and work with that by ensuring that a problem in one headline can't corrupt another, rather than trying to write down a complete spec for all of org-mode.

My comment about "owning their children directly" is for tree manipulation. Consider what happens if you have a tree or linked list in Rust where each node owns its children directly -- you can't change tree structure while holding any references to elements of the tree, unless you use a solution like Arc+Mutex, Rc+RefCell, or a layer of indirection like indextree. This makes it hard to write code that operates on trees, since you have to refer to elements by storing a path from the root or whatever.

I was referring mostly to Headlines owning lower-level Headlines beneath them, but I can also see this issue for your example -- I can't, say, hold a reference to the 3rd section while changing the 1st, or to the Title while changing a section. If the syntax isn't too deep, then it's not so bad to keep track of indices, but once you get into recursive structures (Headlines can have child Headlines), it makes it harder to structure your code.

I like the idea of decoupling lifetime of recursive structures from the elements themselves, as indextree does. Then if I remove a child headline from its parent, both are still valid. I can keep changing the now-orphaned child, or attach it to another headline, etc. This is how Orgize handles headlines today, and I think it works well.

calmofthestorm commented 4 years ago

If you really do want to go down the strongly typed path, by the way, I'd consider builder pattern + immutable Elements. Then there is a builder for every Element, and you can enforce value dependent invariants when you Build the element without trying to make the AST strongly typed at the type level.

The "builder" doesn't even need to have all the Rustic setters; it could just have public fields the user sets directly.

calmofthestorm commented 4 years ago

I realize the above may seem a bit at odds with all the bug reports I'm filing for edge cases. I care a great deal about parsing the tree structure and headlines correctly, but Sections and anything under them scare me less than being one * away from turning hundreds of headlines into one or whatever.

PoiScript commented 4 years ago

I do want to be clear, since you note you don't have a ton of free time and want a clear solution, that I'm not sure this is really a problem.

It's hard to estimate at this moment. But I'll try to make it smoothly.

So as a user, I have the concern that even a completely unambiguous, strongly typed AST, may not do a good job of capturing my files, and it would also be a lot of work, and could lead to a more complex API, since now every mutation has to go from valid tree -> valid tree.

I see. I just realize that validation actually cannot completely rely upon type checking. For example, we still need to manually make sure that each headline's level is greater than its parent's level. So, we may still need a validate function or something similar.

I'll grant that if you treat org-element as the spec it may be possible to be completely unambiguous, but there will still be inconsistencies with org-mode, and not just weird edge cases.

I'm agree. Basically, it's impossible to be compatible with all of them. So I think we can just choice the simpler and less awkward way if the specifications is not precise in some cases.

My comment about "owning their children directly" is for tree manipulation. Consider what happens if you have a tree or linked list in Rust where each node owns its children directly -- you can't change tree structure while holding any references to elements of the tree, unless you use a solution like Arc+Mutex, Rc+RefCell, or a layer of indirection like indextree.

Ok, I got it. It's actually yet another inconvenience I encountered when using indextree. Sometimes you may want to iterate and modify some elements at the same. But it's impossible since iteration requires a reference to the arena and modification requires a mutable reference to it. One workaround is to collect all NodeIds to a Vec and then iterate through it. It's ok, but a bit weird.

Then if I remove a child headline from its parent, both are still valid. I can keep changing the now-orphaned child, or attach it to another headline, etc. This is how Orgize handles headlines today, and I think it works well.

I see. Attaching and detaching node is certainly the key feature missing in nested struct.

Finally, thank you for your other suggestions. Sorry for not able to reply them one by one.

After some thinking about it, I'm afraid we still need to sticky with indextree for now. Even though we have get rid of indextree, there're still need to do some hack. indextree is not perfect, but it comes with an acceptable trade-off.

As for this issue, I'll just keep it open. I'll still trying to find a good replacement for indextree.

calmofthestorm commented 4 years ago

I just realize that validation actually cannot completely rely upon type checking. For example, we still need to manually make sure that each headline's level is greater than its parent's level. So, we may still need a validate function or something similar.

Strictly speaking, this is possible using something like https://docs.rs/typenum/1.12.0/typenum/ . I have no idea what it would be like in practice, however, to use something like that. I'm guessing some combination of slow compile time and very complex types being passed on to the user making it hard to write generic code. Almost certainly would be painful to actually implement, but a neat thought experiment.

Agree on your other points, thanks for the response.

PoiScript commented 4 years ago

Speaking of typenum, I actually had a bad experience with it. That was when I tried to implement some crypto functions with block_modes and aes-soft and was struggling to set a key and iv properly.

PoiScript commented 7 months ago

Starting from v0.10, indextree will be replaced with rowan. See https://github.com/PoiScript/orgize/issues/70