CommonMark? - Githubissues

stevengj commented 10 years ago

See this article. It would be interesting to see how the Markdown.jl parser etc. compares (in both performance and behavior) to the C99 CommonMark reference implementation.

hayd commented 9 years ago

https://github.com/jgm/CommonMark#running-tests-against-the-spec

To run the tests using an executable $PROG:
python3 test/spec_tests.py --program $PROG

hayd commented 9 years ago

I skimmed through the first 25% or so of failing tests with:

# Note: python3 only
tests = JSON.parse(readall(`python test/spec_tests.py --dump-tests`))

function correct(test)
    try
        return Markdown.html(Markdown.parse(test["markdown"])) == test["html"]
    catch
        println("error in $(test["example"]): $(test["section"])")
        return false
    end
end

failing = filter(x -> !correct(x), tests)

# and to quickly look at results from a failing test
check(n) = (println(repr(failing[n]["markdown"])); println(repr(failing[n]["html"])); println(repr(Markdown.html(Markdown.parse(failing[n]["markdown"])))))

So far I've found the following: cc @one-more-minute

[ ] tab expansion (`"1\t22\t333\t4444\t5" becomes "1 22 333 4444 5")
[ ] new lines shouldn't be dropped e.g. "--\n**\n__\n" becomes "--\n**\n__\n"
[x] "--" shouldn't become emdash
[x] precedence of hr > list
[x] *** should parse to *** not *
[x] header underline syntax... but hr if there are spaces
[x] disallow h7+, just let the #s through
[x] ATX headers ("#5 bolt\n" becomes "#5 bolt\n")
[ ] indent before headers (Edit: not sure what I meant by this!)
[x] hashes shouldn't disappear ("# foo#\n") (not quite true, but have followed spec)
[x] empty headers are ok
[x] html escaping (may be escaping too much??)
[x] blank line seperated code blocks should be a single code block
[x] ~ should also work as fenced code blocks
[ ] preserve indentation in fenced code blocks (not sure what I meant by that, it is preserved?)
[ ] code should have a new line if longer than one line
[x] single line fenced code blocks (I can do this easily once my other PR is merged)
[ ] ignoring html (this seems out of scope tbh)
[ ] named links (not sure good way to implement this, but is kinda needed as used everywhere)
[ ] escaped characters e.g. \!
[x] block quote can start with up to 3 spaces
[ ] block quote can run over several lines (aka "Laziness", not sure on a good strategy for this)
[ ] lists with more than one block inside (e.g. p, code, block)
[ ] lists preserve spacing
[ ] ordered list within block (can we not do nested atm??) About 100 tests pass, over 400 fail (hopefully many are simple fixes, certainly most are related).

MikeInnes commented 9 years ago

Thanks for taking a look at that, that's a good list to have. What's quite nice is that (perhaps surprisingly) there are very few particularly major things missing.

The main exception is named links – I do have a way to implement them, but just haven't gotten round to it yet. I need to do a tiny bit of refactoring as well, I think.

It would be cool to have some kind of benchmark for performance as well, but I'm much less worried about Markdown.jl being crazy fast as long as it's not too slow to get the job done.

hayd commented 9 years ago

@one-more-minute a word of warning, I only got 25% through the list (so this could double/triple)! Will append anything else major that pops out. I agree it shouldn't be too bad - it's great to have a thorough test/perf. There was a couple of things that raise, IIRC they were from named links.

tab expansion may also be tricky, not sure how to do that. The example I gave above didn't render as I expected (FIXED)! You seemingly need to count chars as you render (or as you parse??).... the game is "render tabs as spaces as if tab stops were length 4".

I don't yet get the subtleties of escaping html characters but allowing some html...

I have fixed a couple of minor things in html rendering, will PR when I go through the entire list.

hayd commented 9 years ago

Ah, maybe the tab expansion should happen prior to main parsing, then it's a bit easier...

I have checked off some (easy ones) of these which are in a local julia branch.

hayd commented 9 years ago

I was sure there was an issue about CLRF ~~but I can't find it~~ Edit: here. I was wondering if the text should go through:

replace(..., r"\r(\n)?", '\n')

not sure how this would be done with the stream model.

JuliaAttic / Markdown.jl

CommonMark? #7