PoiScript / orgize

A Rust library for parsing org-mode files.
https://poiscript.github.io/orgize/
MIT License
277 stars 34 forks source link

Construct Org tree #54

Open samyak-jain opened 2 years ago

samyak-jain commented 2 years ago

Is it possible to somehow manually construct the Org AST and then use the write_org method to convert it back to an org file?

The use-case that I'm trying to accomplish is the following:

  1. Use orgize to parse an org file into the Org struct
  2. Serialize and store this information in a DB
  3. Make modifications to the DB as needed
  4. Derserialize this back to an Org struct
  5. Convert this back to an on org file using the write_org method.

Currently, I am serializing and deserializing to JSON using serde. Had to fork and make some changes to get that to work: https://github.com/samyak-jain/orgize/tree/deser. And them I'm directly storing this JSON into an sqlite storage. However, this setup is really cumbersome to work with. The indextree data structure when serialized to a JSON, doesn't lend itself very well to making changes to the structure. There's a whole lot of duplication of the same element across the structure. While it may be a good way to store the representation, it seems to be really hard to make modifications to it.

I would be interested to know thoughts on:

  1. Is there an easier to work with representation that we can convert back and forth from? I know there is an iterator over events which seems to slightly better but I don't see a way to get the Org structure back from that.
  2. Having APIs to more easily modify the Org sturcuture directly? This isn't as ideal since I cannot independently modify the DB and will reply on derserializing to the Org struct and then making modifications everytime.

Is there something missing? Would love to know if there are better ways to tackle this.

calmofthestorm commented 2 years ago

Out of curiosity, why not skip the JSON and just use org's text representation as your serialization format? Are you modifying the db in another language?

Regardless: Orgize's data structures aren't really a good fit for this use case, being intended for high performance processing. I'd recommend instead creating your own data structures and copying the data into them, then serializing that, to get a decent JSON representation. You may want to consider a list of headlines, where each headline has a unique ID, and points to its parent.

In either case, it would be helpful to better understand what you are trying to do, and what tools you are using other than Rust, Orgize, and SQLite (if applicable).

samyak-jain commented 2 years ago

So to give some context around what I'm doing, I'm basically writing a program to do bi-directional sync between Org and CalDav. Every time I make a change to an org file, it will convert it into ICS and then that will automatically trigger vdirsyncer to sync to my caldav server. Every time vdirsyncer pulls any changes, it will trigger the script to read the changed ics file, convert it into org and then overwrite the org file. I am handling conflict resolution by last-updated timestamp for each individual task. The caldav server will have that property for each task already. I am using a database because every time there's a change in the org file, my program will diff between the changed org file and the org representation in the db and find which task needs to have the last-updated timestamp changed. This is just a high-level overview of what I'm trying to accomplish. Of course, this isn't the perfect conflict resolution strategy but I think it is good enough for a start.

To answer your questions. Regarding why I can't use org text as the serialization format, I could. But every time I want to make any modifications, I will have to read the file, convert it to Org struct, make changes to that (which I'm not sure what's the best way to do anyway) and then serialise it back again. Using something like JSON is generally 1) easier to manipulate since it is a very simple structured format 2) It should be more performant since we can create indexes on expressions (https://sqlite.org/expridx.html). Of course the current problem with this approach is that, since indextree is hard to manipulate, converting the same structure into a JSON doesn't really help all that much.

I agree with you that Orgize's data structure isn't suited for this. Creating my own data structure should be straightforward but I'm not sure how I will convert from my structure back to Orgize's indextree. I need this since I want to use Orgize's write_org method to generate the org file that will be overwriting the existing org file that needs to be replaced.

The main part I'm struck in is how do I convert any data structure I create back to the Org struct so that I can eventually convert it back to the org file format.

calmofthestorm commented 2 years ago

Are you representing the entire org file as one row in SQLite? Or is there one row per headline? I'm not sure why one would do the former. Headline per row makes sense, and is something I was mulling over awhile back for better diffs. I'd recommend creating a schema with columns for the properties you want to index on -- database indices won't help you for the JSON blob. You'll probably want a "full text" column, but I still don't follow what the advantage would be to use JSON there instead of just reparsing org. It might be easier to manipulate JSON, but you'll then need logic to transform the JSON back to Org, which you'll need to debug. For example, an invalid date, etc now needs to be handled.

There is one good reason though, which is that since any string is a valid Org file, using the JSON representation could help guard against unintended changes (e.g., a bug where you intend to change a timestamp but end up inserting a new headline into it or something), since you can validate that it's a date rather than just reparsing the whole thing.

It's unusual IME for performance to matter with Org files when using Rust (or even Python) to do the processing. Elisp is very, very slow.

Regardless, it sounds like that's not what you're having trouble with, so to answer your question specifically:

Orgize has APIs that allow modifying it, but they are hard to work with IMO. You can create a new empty Org, then use methods such as append on the Document/Headlines to modify it. A few relevant parts: arena_mut, document append, etc.

I think the current API works well for reading but is hard to use for generating. You might also prefer a hybrid approach where you treat each headline as its own Orgize document, and then handle the newline and * yourself. I have some code along these lines I could look at sharing if it would be helpful, though the editing functionality was one of the last things I wanted to finish before publishing it, and is thus not yet in ideal shape.

If worst comes to worst, Org is a simple format to emit compared to parsing it. Writing your own code to emit your own format and using Orgize only to parse may be a practical, if not ideal, solution, to your problem:/

samyak-jain commented 2 years ago

So, looking at the append method, it seems to be taking just a headling object which can be constructed using a title. What about adding other sections? How do I add arbitrary elements (https://docs.rs/orgize/latest/orgize/elements/enum.Element.html)?

I think regarding the reparsing Org question. I am not sure how I would make diffs? I'm not sure if something like that is possible in the indextree representation. If I convert to a different representation, then I'm back to the problem of not being able to convert it back to indextree so I won't be able to make changes to the org file.

Regarding you question on how I'm representing this in SQL. I tried both ways but putting everything in one row is what works because if I don't do that, I wouldn't be able to put it back together since separating over multiple rows would force me to use a different representation which I can't easily convert back from. I agree that this is really ugly which is why I wanted to figure out a way to losslessly convert this into a representation I can work with and convert back and forth from. I agree on having certain properties as part of the schema but using something structured like JSON would be much better because you certainly can index over expressions. It is possible like you suggested that performance might not be a huge concern. I don't have any benches and tests. It is pretty early for me to worry too much about it so I agree that focusing on getting this to work should be more important to me.

I am not sure I quite understand the hybrid approach you mention. It would be great if you could share the code. It should help me understand where you are coming from.

Yeah, I did consider generating it on my own but I definitely want to consider all possible options before I do that.

calmofthestorm commented 2 years ago

I think you have to manipulate the arena directly via arena_mut and then use the arena's parent/child/etc functions to add arbitrary elements. Not straightforward at all. I haven't done this, FWIW.

I'll try to dig up what I was working on. My intent was in some ways similar to what you're trying to do. In particular, I remember that trying to compute diffs was what made me give up on the overall project. In particular, trying to figure out which headline in a new file corresponds to the same headline in the previous version is hard (though one option would be to give every Org mode an ID property, though this adds a properties drawer for every item, bloating it).

I don't have any answers for the diff thing, that's hard. But I may have a slightly friendlier API for parsing/emitting trees.

calmofthestorm commented 2 years ago

Here's an example:

        let mut org = orgize::Org::default();
        {
            let mut title = orgize::elements::Title::default();
            title.raw = "Other stuff".into();
            title.keyword = Some("TODO".into());
            title.properties.pairs.push(("FOO".into(), "BAR".into()));
            let start = orgize::elements::Datetime {
                year: 2109,
                month: 11,
                day: 11,
                dayname: "Mon".into(),
                hour: None,
                minute: None,
            };
            let scheduled = orgize::elements::Timestamp::Active {
                start,
                repeater: None,
                delay: None,
            };
            title.planning = Some(Box::new(orgize::elements::Planning {
                deadline: None,
                scheduled: Some(scheduled),
                closed: None,
            }));
            let mut headline = orgize::Headline::new(title, &mut org);
            headline.set_level(1, &mut org).unwrap();
            let org_doc = org.document();
            org_doc.append(headline, &mut org).unwrap();
        };

        let mut text = Vec::default();
        org.write_org(&mut text).unwrap();
calmofthestorm commented 2 years ago

I uploaded the code I mentioned: https://github.com/calmofthestorm/starsector

samyak-jain commented 2 years ago

This is really helpful. I'll take a look into this. Thanks! Couple of questions:

  1. I am currently converting the Org indextree into an intermediate representation (something like a Vec<Task> where Task is some struct that I have defined). This should make it easy to store it in a DB like sqlite so that I can later perform a diff. I am planning on attaching a UID to every Task in the DB as well as attaching it to the property drawer in the Org representation (Will need to think of a good way to hide this later on). This should help with the diff as well as sync with Caldav since ics requires a UID as well. The trouble for me is, embedding relationships into my representation. I believe you previously suggested to just have each task refer to its parent. So I'm thinking I can just have each task store the parent UID. Right now, for converting from the Org indextree, I'm iterating over all the headlines, using the headlines() method. I wonder how I can get a reference to the parent node from a particular headline node? More specfically, I want the Title element of the parent headline (since that's where I'm storing the UUID which should be in the PropertiesMap).
  2. I was reading https://github.com/calmofthestorm/starsector/blob/main/examples/edit.rs. So, is arena.parse_str equivalent to Org::parse from the Orgize crate? Does starsector internally use Orgize for parsing? Let's say for each task in the DB, I also store the NodeId from the Arena. Can I just use the same node id from Orgize in the starsector arena as well? Or can I somehow pass the arena from Orgize to starsector?
calmofthestorm commented 2 years ago

If you are willing to give every headline an ID, I think the diff problem gets much easier, since you can simply diff two nodes iff they have the same ID rather than trying to match the tree structure. I really wish Org let you hide that or made this easier somehow.

If every task has an ID, then yes, having each store the ID of its parent will work well to represent the tree structure. You can get the parent of a node by calling ancestors().skip(1).next() (see indextree docs). That will give you a NodeId, which you can then use with (*org.arena())[parent_id] to get the parent's Element. Unfortunately the only way I know to get a specific headline from Orgize is to iterate over all headlines. There's probably a better way. Here's an example:

fn get_parent_via_ids() {
    let org = orgize::Org::parse("* hello\n** world");
    let mut h = org.headlines();
    let ha = h.next().unwrap();
    let hb = h.next().unwrap();
    assert!(h.next().is_none());

    println!(
        "World node id is {:?} {:?} {:?}",
        hb.headline_node(),
        hb.section_node(),
        hb.title_node()
    );
    let parent = hb
        .headline_node()
        .ancestors(&org.arena())
        .skip(1)
        .next()
        .unwrap();
    println!("parent of {:?} is {:?}", hb.headline_node(), parent);
    println!("parent of {:?} is {:?}", hb.title_node(), parent);
    println!(
        "Children of {:?} are {:?}",
        parent,
        parent.children(org.arena()).collect::<Vec<_>>()
    );

    // Not efficient, but I'm not sure how else to get the headline from its ID.
    for headline in org.headlines() {
        if headline.headline_node() == parent {
            println!("Parent node title id is {:?}", headline.title_node());
            println!("Parent title is {:?}", &headline.title(&org));
            break;
        }
    }

    tree_traversal();
}

You may also consider doing a depth-first traversal of the tree starting at the document. That would let you efficiently get the parent of every headline:

fn tree_traversal() {
    let org = orgize::Org::parse(
        "* Hello\n** World\n*** Text!\n*** More text\n* Other top level node\n** Text beneath it",
    );

    // Map from each headline indextree node id to its parent headline id. Top
    // level headlines are not included.
    let mut parent_map = HashMap::new();

    // Map from each headline indextree node id to the headline.
    let mut headline_map = HashMap::new();

    let mut stack: Vec<orgize::Headline> = org.document().children(&org).collect();
    while let Some(headline) = stack.pop() {
        headline_map.insert(headline.headline_node(), headline);
        for child_headline in headline.children(&org) {
            parent_map.insert(child_headline.headline_node(), headline.headline_node());
            stack.push(child_headline);
        }
    }

    println!("\n\n---\n\n");

    for (id, parent_id) in parent_map.iter() {
        println!("Parent of {} is {}", id, parent_id);
    }

    for (id, headline) in headline_map.iter() {
        println!("id {}: {}", id, &headline.title(&org).raw);
    }
}

Note that in the above, "id" is referring to the NodeId from indextree, and has nothing to do with the Org mode ID in the PROPERTIES drawer.

Regarding your second question: starsector has its own parser for the structure of the document and the headline line with the stars itself. It does not require Orgize, and uses its own data structures. However, Orgize is an optional dependency. If enabled, it can be used to parse the PROPERTIES drawer and planning line (SCHEDULED, DEADLINE, etc).

starsector and orgize both use indextree, but the IDs cannot be used as you describe. In fact, when using orgize, starsector only parses one headline per orgize::Org.

samyak-jain commented 2 years ago

Thanks! This makes sense. Just to confirm, I'll presumably have to do some filtering, right? I'm guessing, that the children of the document can be stuff other than a headling and the headline itself can have children that are not headlines like for example sections. If that's correct, it should fairly simple to pattern match and get that working. One thing I'm unsure about. Is the parent of a headline always a headline (of course not including the top level ones)? Is it possible for Orgize to give a tree where the parent of a headline is some arbitary element? If that's the case, is such a headline returned from the headlines() method? And to be able to fetch those headlines, will I have to do another pass through the tree and assign them an empty parent like I do for the top-level headlines? If the headlines() method does not return such headlines, parsing them is going to be tricky I think.

calmofthestorm commented 2 years ago

Document::children and Headline::children return an iterator of Headline, so I don't think you'll need to filter if you use them.

indextree::NodeId::children and indextree::NodeId::ancestors return all nodes, so I think if you used them, you would find that each headline can have a title child, a section child, and zero or more headline children of its own. Even here, I think that each non-top-level headline's parent is a headline (but headlines have non-headline children).

Accessing the arena directly is sometimes necessary, but it's kind of an implementation detail of Orgize. If you stick to the children methods on Document and Headline you should be fine.

samyak-jain commented 2 years ago

Gotcha. I also have a question around the content of each section_node of a headline. Is it possible to get the content as the original raw string instead of manually going down the tree? Also, is there any difference between the OrgHandler and the HTMLHandler traits? I'm planning on writing a markdown export so was wondering which one to leverage.

calmofthestorm commented 2 years ago

I don't believe you can get the original raw string with Orgize. I'm not sure how the traits work -- never used anything other than writing it out to plain text.