cmark_markdown_to_html len parameter should be size_t

craigbarnes commented 9 years ago

The cmark_markdown_to_html function currently takes an int for the len parameter. Wouldn't it make more sense to use a size_t? I'm currently having to do something like:

#include <assert.h>
#include <limits.h>
#include <cmark.h>

char *example(const char *input, size_t length) {
    assert(length < INT_MAX); // avoid int overflow
    return cmark_markdown_to_html(input, (int)length, CMARK_OPT_DEFAULT);
}

...which seems a little contorted.

jgm commented 9 years ago

Yes, this seems right. @nwellnhof what is your opinion? There was some related discussion on PR #2 (which was never merged).

nwellnhof commented 9 years ago

The problem is that libcmark internally uses ints for string lengths. So strings larger than 2GB aren't supported, even on most 64-bit platforms where int is 32-bit. If we switched to size_t in our public API, we'd have to make the range check at some point inside libcmark. If the check fails, all we could do is to abort with an error message because we don't propagate such error conditions in our API calls.

This would make things easiers for some users and I don't think it's a problem to throw an irrecoverable error in this case. Hopefully, we'll never see multi-gigabyte Markdown documents, but that's what people probably said about XML documents in the 90s.

Switching to size_t breaks the API, of course, but I'd say better sooner than later. This also opens up the possibility to support 64-bit string buffers later without an API change.

I can offer to work on this and also fix all 64-bit to 32-bit conversions while I'm at it.

In addition to PR #2, we also have the problem that the strbuf code doesn't check for integer overflow of the buffer size.

jgm commented 9 years ago

Thanks, @nwellnhof, I agree that this seems a good approach.

jgm commented 9 years ago

Actually I think we may already be able to handle Markdown documents > 2 GB, even on 32 bits. We feed the parser in chunks of 4096 bytes, so for input we only need a buffer that big. Of course, if someone has a really huge code block or something, it could go over the limit. But I think we should be fine with ordinary documents.

[EDIT:] On second thought, the HTML renderer will overflow the buffer in such cases. Note, however, that one could in principle use a different renderer, which outputs stuff directly to a file instead of building up a buffer.

nwellnhof commented 9 years ago

Yes, the parser should only be limited by the size of blocks. On 32-bit, it will run of memory after a couple of GBs, though.

commonmark / cmark

cmark_markdown_to_html len parameter should be size_t #53