Byron / pulldown-cmark-to-cmark

Convert pulldown-cmark Events back to the string they were parsed from
https://docs.rs/crate/pulldown-cmark-to-cmark
Apache License 2.0
43 stars 37 forks source link

Is it possible to write back code block backtick count to original? #20

Open yumetodo opened 4 years ago

yumetodo commented 4 years ago

After #17, #18, code block backtick count is configurable. However, there is another problem.

Now let's consider such senario like below:

ex.) https://github.com/yumetodo/markdown_img_url_editor_rust/blob/master/src/lib.rs#L74

  1. Parse markdown file with pulldown-cmark
  2. Edit somewhare.
  3. Write back to markdown using pulldown-cmark-to-cmark

In this case, there is nothing to emulate block backtick count.

As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.

Byron commented 4 years ago

I think this is closely related to #16 - maybe you can join the people who start talking to pulldown-cmark to make not degenerating information a priority. Otherwise there is no way to tell how many backticks were parsed.

yumetodo commented 4 years ago

@Byron Do you mean that this is pulldown-cmark's matter? What information is lacked after passing the pulldown-cmark event system to resolve this issue? Without the information, I cannot create an issue at the pulldown-cmark.

Byron commented 4 years ago

This means you can try for yourself how pulldown-cmark maintains information about backticks. If it does indeed not maintain any, then there is nothing we can do here.

yumetodo commented 4 years ago

I use a5f644a . I understand that the Fenced event doesn't provide backtick count.

(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?

$git clone https://github.com/Byron/pulldown-cmark-to-cmark.git
Cloning into 'pulldown-cmark-to-cmark'...

remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 347 (delta 2), reused 7 (delta 2), pack-reused 336
Receiving objects: 100% (347/347), 86.72 KiB | 275.00 KiB/s, done.
Resolving deltas: 100% (200/200), done.

$cd markdown_img_url_editor_rust/

$cargo install pulldown-cmark
    Updating crates.io index
  Downloaded pulldown-cmark v0.7.2
  Downloaded 1 crate (102.7 KB) in 1.21s
  Installing pulldown-cmark v0.7.2
  Downloaded unicase v2.6.0
  Downloaded memchr v2.3.3
  Downloaded unicode-width v0.1.8
  Downloaded bitflags v1.2.1
  Downloaded version_check v0.9.2
  Downloaded getopts v0.2.21
  Downloaded 6 crates (110.0 KB) in 0.97s
   Compiling version_check v0.9.2
   Compiling bitflags v1.2.1
   Compiling memchr v2.3.3
   Compiling pulldown-cmark v0.7.2
   Compiling unicode-width v0.1.8

   Compiling getopts v0.2.21
   Compiling unicase v2.6.0
    Finished release [optimized] target(s) in 33.27s
  Installing /home/yumetodo/.cargo/bin/pulldown-cmark
   Installed package `pulldown-cmark v0.7.2` (executable `pulldown-cmark`)

$cat sample.md
`````markdown
````markdown
```typescript
console.log("arikitari na sekai");

$cat sample.md | pulldown-cmark -e 0..90: Start(CodeBlock(Fenced(Borrowed("markdown")))) 14..85: Text(Borrowed("markdown\n```typescript\nconsole.log(\"arikitari na sekai\");\n```\n\n")) 0..90: End(CodeBlock(Fenced(Borrowed("markdown")))) EOF

$cargo run --example stupicat -- sample.md | pulldown-cmark Finished dev [unoptimized + debuginfo] target(s) in 0.01s Running target/debug/examples/stupicat sample.md

````markdown
```typescript
console.log("arikitari na sekai");
```
Byron commented 4 years ago

(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?

I don't know either, but I don't entirely know what the examples are supposed to show. That is, I don't understand what's expected, and what you think should be done.

As this issue is about backtick counts and it's clear that pulldown-cmark doesn't keep them, I would conclude there is nothing we can do here.

If the example above indicates another bug in pulldown-cmark-to-cmark, I think this should move to a new issue if there is interest in getting it fixed.

yumetodo commented 4 years ago

OK. I created raphlinus/pulldown-cmark#461 so that I will wait that responce.

yumetodo commented 4 years ago

pulldown-cmark's collaborator says that count it by yourself using Event::Start's range information.

if let (Event::Start(Tag::CodeBlock(CodeBlockKind::Fenced(..))), range) = event {
    let num_backticks = source_text[range].chars().take_while(|c| c == '`').count();
}
Byron commented 4 years ago

Fantastic. With this it is certainly possible to make the requested change. I think it should be quite straightforward.

yumetodo commented 4 years ago

Of course, they are not always equal between begin count and end count of code block backtick like below. And also, tilde(~) is allowed as a code block fence char.

```text
aaa
lambda-fairy commented 3 years ago

Won't this require changing the API? The cmark function currently has no access to the original source text.

Byron commented 3 years ago

Good point! I think it could be done by optionally passing the input buffer as part of the Options. There using the range information should be a breeze.

yumetodo commented 3 years ago

In that case, it's better to get the diff between the event stream created from the original input text and the passed event stream...

Byron commented 3 years ago

That sounds like a post-process entirely unrelated to this crate, or am I missing something? Maybe knowing more about this would help to understand why you would discard a proposal which to my mind would allow using ranges as suggested by a pulldown-cmark maintainer.

yumetodo commented 3 years ago

I'm concern that the index information in the range is valid when the passed event stream is modified from the original. pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one. That is what @lambda-fairy pointed out, as far as I understand.

lambda-fairy commented 3 years ago

Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.

The two main use cases I am thinking of are (a) mdbook preprocessors, and (b) Markdown code formatters. (I'm interested in the latter.) In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.

My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.

Byron commented 3 years ago

@yumetodo

I'm concern that the index information in the range is valid when the passed event stream is modified from the original. pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one.

@lambda-fairy

My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.

Absolutely, the provided input buffer must match what's actually parsed, and it's something the programmer has to assure. That said, I would entrust then to be able to do that or notice very quickly that something is wrong. In my understanding this crate should do as much if not all of the 'heavy lifting', but I can only acknowledge that it might be a different story when actually implementing it.

Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.

This sounds like a heuristic to me that is based on the observed code block? Maybe this could be elaborated so we are on the same page.

Just to sum up what I got so far as options to solve it this issue:

  1. pass in a source buffer and use the provided ranges to read the portions of the source we are interested in
  2. diff source and output, or more generally, apply a post process (probably to be implemented in a different crate, can be part of this repo though)
  3. Set the number of backticks based on the contents of the code block (needs elaboration)

Something I would like to highlight in parting is a comment of @lambda-fairy as it helps me to see the value behind this issue:

In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.

This is something that ideally is easy because the input events are made so that they don't degenerate information, but it turns out that's not a goal of the input parser. This crate could be the remedy and I would love to find a general purpose solution to this issue. To me, option 1 is able to do it for sure and provides all information needed to apply more fixes in future, too.

I would hope we could get to a point where any of the options really is implemented as proof of concept so we can eliminate those who might not work in practice and make progress towards resolving this issue.

yumetodo commented 3 years ago

When I first created this issue, I was thinking of a scenario where pulldown-cmark would parse it, edit some of it, and pulldown-cmark-to-cmark would convert it back to markdown(https://github.com/yumetodo/markdown_img_url_editor/issues/40 (Written in Japanese)).

However, now I got it that is too difficult to implement. For such a use case, we should use Parser::into_offset_iter() and not by this issue.

Now I agree to reduce the scope of this issue to the two main use cases @lambda-fairy says.

Byron commented 3 years ago

I have created a sketch, see #22, to show how this could work using the 'Range and buffer' method. This is the only way I would know how to do what's needed for this issue, and in future.

Maybe those who have other ideas can sketch them and submit PRs so we all know what we are talking about.

I am particularly interested in learning what @yumetodo thinks about the PR (please feel free to comment there) as this supports exactly what they have done.

willcrichton commented 2 years ago

As another instance of this issue, I am writing an mdBook preprocessor using pulldown-cmark-to-cmark for rust-lang/book. The Rust book uses double-backticks for inline code to escape single-backticks, such as:

here we embed ``an error `message` <- with a backtick``

Which correctly renders as:

here we embed an error `message` <- with a backtick

However, this package will replace the double-backtick with a single backtick, producing:

here we embed `an error `message` <- with a backtick`

Which incorrectly renders as:

here we embed an errormessage<- with a backtick

willcrichton commented 2 years ago

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

Byron commented 2 years ago

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

That's fantastic, thanks a lot!

I pulled the changes and saw a test was added for validation, and besides that, the spec-test is up 2 tests as well to 425/649 passed! I'd think that's a very measurable improvement worth a new release in case it helps.

mgeisler commented 9 months ago

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

Hi @willcrichton, I believe the case of inline code with backticks has been fixed by #56.

As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.

@yumetodo, I'm just another user of the library — so this is just my opinion :smile:

I would not expect the output to be equal after parsing and serializing back to Markdown since the Markdown AST might not preserve all information. However, I would expect it to be equal if you parse it again and serialize to Markdown.

I did some experiments on this earlier this year in #55. That lead me to find a few corner cases in this library and also in pulldown-cmark.

dalance commented 8 months ago
  1. Set the number of backticks based on the contents of the code block (needs elaboration)

I'm interesting in this approach. In https://github.com/google/mdbook-i18n-helpers/pull/129, I wrote the following code to count backticks in code block:

```rust fn check_code_block_token_count(group: &[(usize, Event)]) -> usize { let events = group.iter().map(|(_, event)| event); let mut in_codeblock = false; let mut max_token_count = 0; // token_count should be taken over Text events // because a continuous text may be splitted to some Text events. let mut token_count = 0; for event in events { match event { Event::Start(Tag::CodeBlock(_)) => { in_codeblock = true; token_count = 0; } Event::End(Tag::CodeBlock(_)) => { in_codeblock = false; token_count = 0; } Event::Text(x) if in_codeblock => { for c in x.chars() { if c == '`' { token_count += 1; } else { max_token_count = std::cmp::max(max_token_count, token_count); token_count = 0; } } max_token_count = std::cmp::max(max_token_count, token_count); } _ => token_count = 0, } } if max_token_count < 3 { // default code block token is "```" which is 3 3 } else { // If there is "```" in codeblock, codeblock token should be extended. max_token_count + 1 } } ```

I think a code like the above may be useful if it is placed in this crate. For example, the following designs can be considered.

let code_block_token_count = pulldown_cmark_to_cmark::code_block_token_count(events);
let options = Options {
    code_block_token_count,
    ..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);
let options = Options {
    code_block_token_count: None, // "None" means the appropriate count is calcurated
                                  // in cmark_resume_with_options
    ..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);

How about these ideas?

Byron commented 8 months ago

Thanks for the writeup and for sharing!

The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?

In any case, if you think having more support for this would help here, even if its scope is limited, then adding it to the crate should be useful. However, I think it should be done in a backward compatible way that also doesn't affect performance. Thus is sounds like option 1 (freestanding function) would be the way to go.

dalance commented 8 months ago

The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?

In my understanding, fenced code block can't be nested. (Refer: https://spec.commonmark.org/0.30/#fenced-code-blocks)

For example, I think the following code block is not a nested code block but a single code block which have "````markdown\n```python\n```\n````" as a plane text. Therefore, an appropriate token count can be judged by counting continuous backticks in the plane text.

`````markdown
````markdown
```python

My goal is not "reproducing the original backtick count" but "generating a valid markdown".
My code counts the longest backticks in the events, so some of the generated backticks may be too long, but it is a valid markdown.

* Original markdown

``````markdown

`````markdown
````markdown
```python
```rust

```c

* Generated markdown by `cmark` and `code_block_token_count` function (In this case, `code_block_token_count` will be 5)

``````markdown

`````markdown
````markdown
```python
```rust

`````c
Byron commented 8 months ago

Thanks for the clarification and the examples, I think we are on the same page now.

Then it's a definitive "Yes," please submit a PR with a new free function that helps to set the correct backtick count based on a pre-scan of all known events.

The code_block_token_count (which is confusingly named given that it is about backticks in code-blocks) documentation could also be adjusted to refer to the new function, whose own documentation could make clear what it does (maybe even by reusing some of the explanation used in this thread).

dalance commented 8 months ago

Thank you for your suggestion. I've opened #65.

dalance commented 8 months ago

@Byron Sorry. I missed the problem of API design. The API of #65 doesn't have the way to specify default backticks count. In the current impl, it refers the hard-coded DEFAULT_CODE_BLOCK_TOKEN_COUNT.

So even If we want to set backtick count to 3, count_code_block_tokens surely returns 4 or more than 4. In other word, when the function returns 4, we can't identify the following two cases.

I think two solutions, but both are breaking change.

let code_block_token_count = count_code_block_tokens(events, 3);
let code_block_token_count = count_code_block_tokens(events).unwrap_or(3);
dalance commented 8 months ago

Alternatively the default value can be changed to 3. 3 is the minimum value allowed in spec.

lf user want to use 4, the follow code is available.

let code_block_token_count = count_code_block_tokens(events);
let code_block_token_count = std::cmp::max(code_block_token_count, 4);

This is not breaking change.

Byron commented 8 months ago

Thanks for bringing this up, and sorry for not catching it. I already yanked the previous release so the fix isn't breaking.

My preference is to go with Option<usize> as return value, it seems most idiomatic and flexible.

Thanks again for your help.

dalance commented 8 months ago

Thank you. I've opened #66.