Byron / pulldown-cmark-to-cmark

Convert pulldown-cmark Events back to the string they were parsed from
https://docs.rs/crate/pulldown-cmark-to-cmark
Apache License 2.0
43 stars 37 forks source link

Surprising result from tables #44

Closed max-sixty closed 2 years ago

max-sixty commented 2 years ago

Thanks for the excellent crate!

I'm seeing that tables are getting a leading \. I may well be doing something wrong!


fn no_op(text: &str) -> Result<String> {
    dbg!(&text.chars());
    let mut parser = Parser::new(text);
    let mut cmark_acc = vec![];

    while let Some(event) = parser.next() {
        dbg!(&event);
        cmark_acc.push(event.to_owned());
    }
    let mut buf = String::new();
    cmark(cmark_acc.iter(), &mut buf)?;

    Ok(buf)
}

#[test]
fn test_table() -> Result<()> {
    let table = r###"
# Syntax

| a |
|---|
| c |

"###;

    assert_display_snapshot!(no_op(table)?, @r###"
    # Syntax

    \| a |
    \|---|
    \| c |

    "###);

    Ok(())
}
Byron commented 2 years ago

That's interesting. Maybe you could try to add a test to the existing test suite and see if it reproduces? It doesn't seem to make sense to add a leading slash to the output.

max-sixty commented 2 years ago

I would like to come back to this. For the moment, I've fixed it with https://github.com/prql/prql/pull/515/files#diff-2d03f5521b73d9624021e9a38cbc73926f522fbb446d3e226be9206c53d1aa48R129, but I realize that doesn't benefit upstream.

Byron commented 2 years ago

It turns out that there is a hidden whitespace character that prevents the whole table from being parsed as table. Hence it escapes the special characters to be sure it will be text.

Here is what I can copy-paste:

| a |
 |---|
 | c |

Note the misalignment. Once the whitespace issue is fixed is parsed like a table and you get the expected results.

max-sixty commented 2 years ago

OK, mea culpa if that's the case. But how do you see that? If I run dbg!(text.chars()) (edited above), I don't see any hidden character:

&text.chars() = Chars([
    '\n',
    '#',
    ' ',
    'S',
    'y',
    'n',
    't',
    'a',
    'x',
    '\n',
    '\n',
    '|',
    ' ',
    'a',
    ' ',
    '|',
    '\n',
    '|',
    '-',
    '-',
    '-',
    '|',
    '\n',
    '|',
    ' ',
    'c',
    ' ',
    '|',
    '\n',
    '\n',
])
Byron commented 2 years ago

It's absolutely possible that copying from the browser adds some strange whitespace that wasn't there before, so certainly mea culpa too :D. It think, however, that this is what's happening here as misalignment that prevents it from being parsed as table.

Cleaning up the above looks good though

# Syntax

| a |
|---|
| c |

If I copy the above and run it through the serializer, it works as expected.

➜  pulldown-cmark-to-cmark git:(main) ✗ cargo run --example stupicat -- <(pbpaste)
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/examples/stupicat /dev/fd/11`
# Syntax

|a|
|-|
|c|%

Note how it trims the whitespace. This means there must be something happening to the input on your end that misaligns something.

Try this input

# Syntax

| a |
 |---|
| c |

And one sees something familiar:

➜  pulldown-cmark-to-cmark git:(main) ✗ cargo run --example stupicat -- <(pbpaste)
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/examples/stupicat /dev/fd/11`
# Syntax

\| a |
\|---|
\| c |%
max-sixty commented 2 years ago

I added the test — it does seem to fail — is there a hidden character there?

Byron commented 2 years ago

It turns out that the parser starts out without any options which doesn't enable table parsing. That's and interesting choice and I also wasn't aware of it anymore.

So instantiating it like so will parse tables: Parser::new_ext(input, Options::all()).

max-sixty commented 2 years ago

That is a very surprising default. Thanks for finding the cause @Byron !