Closed craigbarnes closed 9 years ago
Yes, this seems right. @nwellnhof what is your opinion? There was some related discussion on PR #2 (which was never merged).
The problem is that libcmark internally uses int
s for string lengths. So strings larger than 2GB aren't supported, even on most 64-bit platforms where int
is 32-bit. If we switched to size_t
in our public API, we'd have to make the range check at some point inside libcmark. If the check fails, all we could do is to abort with an error message because we don't propagate such error conditions in our API calls.
This would make things easiers for some users and I don't think it's a problem to throw an irrecoverable error in this case. Hopefully, we'll never see multi-gigabyte Markdown documents, but that's what people probably said about XML documents in the 90s.
Switching to size_t
breaks the API, of course, but I'd say better sooner than later. This also opens up the possibility to support 64-bit string buffers later without an API change.
I can offer to work on this and also fix all 64-bit to 32-bit conversions while I'm at it.
In addition to PR #2, we also have the problem that the strbuf code doesn't check for integer overflow of the buffer size.
Thanks, @nwellnhof, I agree that this seems a good approach.
Actually I think we may already be able to handle Markdown documents > 2 GB, even on 32 bits. We feed the parser in chunks of 4096 bytes, so for input we only need a buffer that big. Of course, if someone has a really huge code block or something, it could go over the limit. But I think we should be fine with ordinary documents.
[EDIT:] On second thought, the HTML renderer will overflow the buffer in such cases. Note, however, that one could in principle use a different renderer, which outputs stuff directly to a file instead of building up a buffer.
Yes, the parser should only be limited by the size of blocks. On 32-bit, it will run of memory after a couple of GBs, though.
The cmark_markdown_to_html function currently takes an
int
for thelen
parameter. Wouldn't it make more sense to use asize_t
? I'm currently having to do something like:...which seems a little contorted.