\r\n prevents further parsing; potential for substantial data loss

calmofthestorm commented 4 years ago

Found another weird case. Strange line endings prevent Orgize from parsing anything after them. If this occurs early in the org file, it can lead to substantial data loss.

fn main() {
    // According to the spec, this is two headlines.
    // org-element parses it as two headlines.
    // Per discussion on https://github.com/PoiScript/orgize/issues/19 I could understand parsing it as three headlines instead.
    // Orgize currently parses it as 1 headline, which is clearly wrong.
    let org = orgize::Org::parse("* \n*\r\n* \n");
    let headline_count = org.headlines().count();
    println!("Orgize parsed {} headlines.", headline_count);
    assert!(headline_count == 2 || headline_count == 3);

    // Orgize seems unable to parse any further headlines after that one.
    let org = orgize::Org::parse("* \n*\r\n* A\n* B\n* C\n* D\n* E\n* F\n* G\n* H\n* I\n* J\n* K\n");
    let headline_count = org.headlines().count();
    println!("Orgize parsed {} headlines.", headline_count);
    assert!(headline_count == 12 || headline_count == 13);

    // If written out, this causes data loss.
    let mut fd = iobuffer::IoBuffer::new();
    let mut s = String::default();
    org.write_org(&mut fd).unwrap();
    fd.read_to_string(&mut s).unwrap();

    // Output: After parse -> emit: "* \n"
    println!("After parse -> emit: {:?}", &s);
}

calmofthestorm commented 4 years ago

I found the problem, but I'm not sure how to fix it. parse_headline takes one headline, and then continues to take lines until it encounters a headline not below its level. parse_headline_level will return that the "*\n" is a headline of level 1, so it won't be absorbed into the previous headline's body. But when we then go to parse that headline, there are remaining characters on the line, and so it is not recognized as a headline.

Thus, it is interpreted as not part of the body when building the previous, yet not a new headline when building the next.

Do you remember why this is operating line by line like this instead of using nom to handle headline parsing? Is it a performance issue? This seems like it could have many edge cases that would be tricky to get right. If it's hard to parse recursively with nom, it would be possible to parse the document into flat headlines in one pass and then build the recursive tree structure after the fact.

I did come up with some more interesting cases that should be considered for a solution here. I'm not confident with the assertions I chose, as many of these could be interpreted a few ways.

    assert_eq!(0, Org::parse("").headlines().count());

    for input in &["*", "* a", "* a\n", "*\r* Hello", "*\r\n", "* a\r", "* a\r\n"] {
        println!("{:?}", input);
        assert_eq!(1, Org::parse(input).headlines().count());
    }

    for input in &[
        "* a\n*",
        "* a\r\n*",
        "* a\r\n* b",
        "* a\n* ",
        "* a\n* \n",
        "* \n*\r\n* \n",
    ] {
        assert_eq!(2, Org::parse(input).headlines().count());
    }

PoiScript commented 7 months ago

I could understand parsing it as three headlines instead.

Org syntax now specifies that, in headline, asterisks character can only be followed by space character. So parsing "* \n*\r\n* \n" will returns two headlines:

DOCUMENT@0..9
  HEADLINE@0..6
    HEADLINE_STARS@0..1 "*"
    WHITESPACE@1..2 " "
    NEW_LINE@2..3 "\n"
    SECTION@3..6
      PARAGRAPH@3..6
        TEXT@3..6 "*\r\n"
  HEADLINE@6..9
    HEADLINE_STARS@6..7 "*"
    WHITESPACE@7..8 " "
    NEW_LINE@8..9 "\n"

If written out, this causes data loss.

In v0.10, the parser are 100% lossless. we have tons of debug_assert! to ensure that.

Do you remember why this is operating line by line like this instead of using nom to handle headline parsing? Is it a performance issue?

Yes, it's mostly for performance issue. When we know exactly where the headline will end, we can avoid throwing error from parser, which causes nom to reparse everything.

Closing this issue since all problems should be resolved in v0.10.

PoiScript / orgize

\r\n prevents further parsing; potential for substantial data loss #26