PoiScript / orgize

A Rust library for parsing org-mode files.
https://poiscript.github.io/orgize/
MIT License
278 stars 34 forks source link

Orgize validation fails when parsing certain unicode values #22

Closed calmofthestorm closed 7 months ago

calmofthestorm commented 4 years ago

In general I expect weird unicode values to get "interesting" results, but I'm going to report this since it results in a panic when debug_assertions are enabled.

Each of these characters, alone, as input, results in a panic in debug builds. I recommend running the example below with --release as otherwise calling parse will panic.

Up to you as to whether it's worth fixing. I saw you had a fuzz test in the source tree so I assume that crashes like this might be of interest, but I can also understand not wanting to go down the unicode rabbithole and it's unclear to me how often these actually come up in real use.

The one or two I tested with org-element work correctly -- a headline containing them in the title is parsed correctly.

fn main() {
    let s = "\u{000b}\u{0085}\u{00a0}\u{1680}\u{2000}\u{2001}\u{2002}\u{2003}\u{2004}\u{2005}\u{2006}\u{2007}\u{2008}\u{2009}\u{200a}\u{2028}\u{2029}\u{202f}\u{205f}\u{3000}";

    for (i, c) in s.chars().enumerate() {
        let org = orgize::Org::parse_string(c.to_string());
        println!("Validation ok for {}: {}", i, org.validate().is_empty());
    }
}
PoiScript commented 4 years ago

Thanks for reporting. Orgize will automatically validate the parsed struct and panic if any error occurs. It's disabled in release mode for increasing performance. For fuzz test, I believe it was broken after I upgraded to 2018 edition. But I just keep forgetting to fix it.

PoiScript commented 4 years ago

Oh, I see. I only check for the ascii whitespaces in some functions. But str::trim actually remove both ascii whitespaces and unicode whitespaces.

PoiScript commented 4 years ago

This was fixed by ba9c83c. But I decided to keep this issue opened as a remainder and closed it once we replace every u8::is_ascii_whitespace with char::is_whitespace.