Open yumetodo opened 4 years ago
I think this is closely related to #16 - maybe you can join the people who start talking to pulldown-cmark
to make not degenerating information a priority.
Otherwise there is no way to tell how many backticks were parsed.
@Byron Do you mean that this is pulldown-cmark's matter? What information is lacked after passing the pulldown-cmark event system to resolve this issue? Without the information, I cannot create an issue at the pulldown-cmark.
This means you can try for yourself how pulldown-cmark maintains information about backticks. If it does indeed not maintain any, then there is nothing we can do here.
I use a5f644a . I understand that the Fenced
event doesn't provide backtick count.
(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark
is broken? Where is the last ````
?
$git clone https://github.com/Byron/pulldown-cmark-to-cmark.git
Cloning into 'pulldown-cmark-to-cmark'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 347 (delta 2), reused 7 (delta 2), pack-reused 336
Receiving objects: 100% (347/347), 86.72 KiB | 275.00 KiB/s, done.
Resolving deltas: 100% (200/200), done.
$cd markdown_img_url_editor_rust/
$cargo install pulldown-cmark
Updating crates.io index
Downloaded pulldown-cmark v0.7.2
Downloaded 1 crate (102.7 KB) in 1.21s
Installing pulldown-cmark v0.7.2
Downloaded unicase v2.6.0
Downloaded memchr v2.3.3
Downloaded unicode-width v0.1.8
Downloaded bitflags v1.2.1
Downloaded version_check v0.9.2
Downloaded getopts v0.2.21
Downloaded 6 crates (110.0 KB) in 0.97s
Compiling version_check v0.9.2
Compiling bitflags v1.2.1
Compiling memchr v2.3.3
Compiling pulldown-cmark v0.7.2
Compiling unicode-width v0.1.8
Compiling getopts v0.2.21
Compiling unicase v2.6.0
Finished release [optimized] target(s) in 33.27s
Installing /home/yumetodo/.cargo/bin/pulldown-cmark
Installed package `pulldown-cmark v0.7.2` (executable `pulldown-cmark`)
$cat sample.md
`````markdown
````markdown
```typescript
console.log("arikitari na sekai");
$cat sample.md | pulldown-cmark -e
0..90: Start(CodeBlock(Fenced(Borrowed("markdown"))))
14..85: Text(Borrowed("markdown\n```typescript\nconsole.log(\"arikitari na sekai\");\n```\n
\n"))
0..90: End(CodeBlock(Fenced(Borrowed("markdown"))))
EOF
$cargo run --example stupicat -- sample.md | pulldown-cmark
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running target/debug/examples/stupicat sample.md
````markdown
```typescript
console.log("arikitari na sekai");
```
(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?
I don't know either, but I don't entirely know what the examples are supposed to show. That is, I don't understand what's expected, and what you think should be done.
As this issue is about backtick counts and it's clear that pulldown-cmark
doesn't keep them, I would conclude there is nothing we can do here.
If the example above indicates another bug in pulldown-cmark-to-cmark
, I think this should move to a new issue if there is interest in getting it fixed.
OK. I created raphlinus/pulldown-cmark#461 so that I will wait that responce.
pulldown-cmark's collaborator says that count it by yourself using Event::Start
's range information.
if let (Event::Start(Tag::CodeBlock(CodeBlockKind::Fenced(..))), range) = event { let num_backticks = source_text[range].chars().take_while(|c| c == '`').count(); }
Fantastic. With this it is certainly possible to make the requested change. I think it should be quite straightforward.
Of course, they are not always equal between begin count and end count of code block backtick like below. And also, tilde(~
) is allowed as a code block fence char.
```text
aaa
Won't this require changing the API? The cmark
function currently has no access to the original source text.
Good point! I think it could be done by optionally passing the input buffer as part of the Options. There using the range information should be a breeze.
In that case, it's better to get the diff between the event stream created from the original input text and the passed event stream...
That sounds like a post-process entirely unrelated to this crate, or am I missing something?
Maybe knowing more about this would help to understand why you would discard a proposal which to my mind would allow using ranges as suggested by a pulldown-cmark
maintainer.
I'm concern that the index information in the range is valid when the passed event stream is modified from the original. pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one. That is what @lambda-fairy pointed out, as far as I understand.
Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.
The two main use cases I am thinking of are (a) mdbook preprocessors, and (b) Markdown code formatters. (I'm interested in the latter.) In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.
My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.
@yumetodo
I'm concern that the index information in the range is valid when the passed event stream is modified from the original. pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one.
@lambda-fairy
My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.
Absolutely, the provided input buffer must match what's actually parsed, and it's something the programmer has to assure. That said, I would entrust then to be able to do that or notice very quickly that something is wrong. In my understanding this crate should do as much if not all of the 'heavy lifting', but I can only acknowledge that it might be a different story when actually implementing it.
Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.
This sounds like a heuristic to me that is based on the observed code block? Maybe this could be elaborated so we are on the same page.
Just to sum up what I got so far as options to solve it this issue:
Something I would like to highlight in parting is a comment of @lambda-fairy as it helps me to see the value behind this issue:
In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.
This is something that ideally is easy because the input events are made so that they don't degenerate information, but it turns out that's not a goal of the input parser. This crate could be the remedy and I would love to find a general purpose solution to this issue. To me, option 1 is able to do it for sure and provides all information needed to apply more fixes in future, too.
I would hope we could get to a point where any of the options really is implemented as proof of concept so we can eliminate those who might not work in practice and make progress towards resolving this issue.
When I first created this issue, I was thinking of a scenario where pulldown-cmark would parse it, edit some of it, and pulldown-cmark-to-cmark would convert it back to markdown(https://github.com/yumetodo/markdown_img_url_editor/issues/40 (Written in Japanese)).
However, now I got it that is too difficult to implement. For such a use case, we should use Parser::into_offset_iter()
and not by this issue.
Now I agree to reduce the scope of this issue to the two main use cases @lambda-fairy says.
I have created a sketch, see #22, to show how this could work using the 'Range and buffer' method. This is the only way I would know how to do what's needed for this issue, and in future.
Maybe those who have other ideas can sketch them and submit PRs so we all know what we are talking about.
I am particularly interested in learning what @yumetodo thinks about the PR (please feel free to comment there) as this supports exactly what they have done.
As another instance of this issue, I am writing an mdBook preprocessor using pulldown-cmark-to-cmark
for rust-lang/book. The Rust book uses double-backticks for inline code to escape single-backticks, such as:
here we embed ``an error `message` <- with a backtick``
Which correctly renders as:
here we embed
an error `message` <- with a backtick
However, this package will replace the double-backtick with a single backtick, producing:
here we embed `an error `message` <- with a backtick`
Which incorrectly renders as:
here we embed
an error
message<- with a backtick
Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.
Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.
That's fantastic, thanks a lot!
I pulled the changes and saw a test was added for validation, and besides that, the spec-test is up 2 tests as well to 425/649 passed! I'd think that's a very measurable improvement worth a new release in case it helps.
Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.
Hi @willcrichton, I believe the case of inline code with backticks has been fixed by #56.
As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.
@yumetodo, I'm just another user of the library — so this is just my opinion :smile:
I would not expect the output to be equal after parsing and serializing back to Markdown since the Markdown AST might not preserve all information. However, I would expect it to be equal if you parse it again and serialize to Markdown.
I did some experiments on this earlier this year in #55. That lead me to find a few corner cases in this library and also in pulldown-cmark.
- Set the number of backticks based on the contents of the code block (needs elaboration)
I'm interesting in this approach. In https://github.com/google/mdbook-i18n-helpers/pull/129, I wrote the following code to count backticks in code block:
I think a code like the above may be useful if it is placed in this crate. For example, the following designs can be considered.
code_block_token_count
as a free standing function which can be used like belowlet code_block_token_count = pulldown_cmark_to_cmark::code_block_token_count(events);
let options = Options {
code_block_token_count,
..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);
cmark*
functionlet options = Options {
code_block_token_count: None, // "None" means the appropriate count is calcurated
// in cmark_resume_with_options
..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);
How about these ideas?
Thanks for the writeup and for sharing!
The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?
In any case, if you think having more support for this would help here, even if its scope is limited, then adding it to the crate should be useful. However, I think it should be done in a backward compatible way that also doesn't affect performance. Thus is sounds like option 1 (freestanding function) would be the way to go.
The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?
In my understanding, fenced code block can't be nested. (Refer: https://spec.commonmark.org/0.30/#fenced-code-blocks)
For example, I think the following code block is not a nested code block but a single code block which have "````markdown\n```python\n```\n````"
as a plane text.
Therefore, an appropriate token count can be judged by counting continuous backticks in the plane text.
`````markdown
````markdown
```python
My goal is not "reproducing the original backtick count" but "generating a valid markdown".
My code counts the longest backticks in the events, so some of the generated backticks may be too long, but it is a valid markdown.
* Original markdown
``````markdown
`````markdown
````markdown
```python
```rust
```c
* Generated markdown by `cmark` and `code_block_token_count` function (In this case, `code_block_token_count` will be 5)
``````markdown
`````markdown
````markdown
```python
```rust
`````c
Thanks for the clarification and the examples, I think we are on the same page now.
Then it's a definitive "Yes," please submit a PR with a new free function that helps to set the correct backtick count based on a pre-scan of all known events.
The code_block_token_count
(which is confusingly named given that it is about backticks in code-blocks) documentation could also be adjusted to refer to the new function, whose own documentation could make clear what it does (maybe even by reusing some of the explanation used in this thread).
Thank you for your suggestion. I've opened #65.
@Byron
Sorry. I missed the problem of API design.
The API of #65 doesn't have the way to specify default backticks count.
In the current impl, it refers the hard-coded DEFAULT_CODE_BLOCK_TOKEN_COUNT
.
So even If we want to set backtick count to 3, count_code_block_tokens
surely returns 4 or more than 4.
In other word, when the function returns 4, we can't identify the following two cases.
codeblock_token_count
can be set to 3.codeblock_token_count
can't be set to 3, it must be 4 or more than.I think two solutions, but both are breaking change.
count_code_block_tokens
let code_block_token_count = count_code_block_tokens(events, 3);
Option<usize>
from count_code_block_tokens
let code_block_token_count = count_code_block_tokens(events).unwrap_or(3);
Alternatively the default value can be changed to 3. 3 is the minimum value allowed in spec.
lf user want to use 4, the follow code is available.
let code_block_token_count = count_code_block_tokens(events);
let code_block_token_count = std::cmp::max(code_block_token_count, 4);
This is not breaking change.
Thanks for bringing this up, and sorry for not catching it. I already yanked the previous release so the fix isn't breaking.
My preference is to go with Option<usize>
as return value, it seems most idiomatic and flexible.
Thanks again for your help.
Thank you. I've opened #66.
After #17, #18, code block backtick count is configurable. However, there is another problem.
Now let's consider such senario like below:
ex.) https://github.com/yumetodo/markdown_img_url_editor_rust/blob/master/src/lib.rs#L74
In this case, there is nothing to emulate block backtick count.
As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.