ebarnard / rust-plist

A rusty plist parser.
MIT License
71 stars 42 forks source link

Add Ascii Reader #44

Closed fstephany closed 3 months ago

fstephany commented 4 years ago

As discussed in issue #42, here's a first pass on an ASCII reader. I haven't used the fuzzer. If you have direction on how to setup and test it, I would be glad to add id.

ebarnard commented 4 years ago

The fuzzer is fairly straightforward, we use cargo-fuzz. If you follow the (minimal) guide to get it installed and then create a new fuzz-target based on the xml_reader.rs target and run it that should just work. You can then put some valid ascii plists in the corpus folder to speed things up a bit if you want.

madig commented 4 years ago

Woah! I am interested in this. I'm working in the font design space and the dominant font design application https://glyphsapp.com uses a "dialect" of the old ASCII plist format for primary storage, with some adjustment here and there "to save bytes". I.e. it will leave quotes out if something is probably either an integer or string. @anthrotype created a Python/Cython implementation at https://github.com/fonttools/openstep-plist that handles these.

I am going to play around with this PR and see how far I can get.

fstephany commented 4 years ago

@madig That's another use case indeed ;) If you have some sample files for testing, it could be nice to check that the code is covering all the cases. If you want to port some unit tests from the Python lib, go ahead!

@ebarnard Thanks a lot for the comments. If you see further things to change do not hesitate. Even if it is code style conformance.

I'm not that sure how to bubble the IO errors from read_exact so I've just added another ErrorKind. If you had something clever in mind, I'll change my naive implementation.

madig commented 4 years ago

@fstephany: On it. I have test files over at https://github.com/madig/glyphsrw-rs/tree/master/data that I'm feeding into my toy implementation.

[Long code listing deleted because of irrelevance]

Edit: The optional interpretation of unquoted strings may need to optional maybe :thinking:

fstephany commented 4 years ago

Oh, this is much more involved than what I'm currently doing:

match Integer::from_str(&string_literal) {
       Ok(i) => Ok(Some(Event::Integer(i))),
       Err(_) => Ok(Some(Event::String(string_literal))),
}

I just accumulate the characters and then try to convert the resulting string to an integer. I do not support floats (actually, even Integer are not part of the Apple spec).

madig commented 4 years ago

Fair enough. This may be overkill for anything not Glyphs.

madig commented 4 years ago

(If there is interest in upstreaming the more complicated logic, maybe this guessing should be optional and off by default since it is not part of the spec.)

madig commented 4 years ago

(Ugh, ignore my code above. Your code is more robust: either it parses as an int (or maybe a float) or it doesn't and then is a string)

madig commented 4 years ago

Hm. Strings with escaped quotes aren't handled properly right now, example:

{
filter = "subCategory == \"Currency\"";
locked = 1;
name = currency;
position = "{-196, 760}";
}

Edit: Let's see if I can implement the tests at https://github.com/fonttools/openstep-plist/blob/master/tests/test_parser.py#L135-L152.

fstephany commented 4 years ago

@madig I've fixed it. Can you check the simple test case to be sure that it was the expected behavior ?

That makes me thing that I should add unit tests for invalid inputs. I don't know if I should completely reject malformed files (e.g., missing comma in a list) or be tolerant. I feel like this format has been abused by so many tools with their own conventions that it might be more effective to accept inputs even if they don't comply 100% with the format.

madig commented 4 years ago

Thanks! Can do at some point this week. Yeah, I want to look into tests for this at some point. There's more stuff like escape sequences (think "hello\012world") and some tricky handling of Unicode escapes that openstep-plist handles.

ebarnard commented 4 years ago

Dialects

This library aims to be minimal and focussed on parsing Apple spec Plists. This means that a Plist that follows Apple's spec should always parse per the spec.

I'm happy to support dialects/extensions so long as they do not affect the parsing of Apple spec Plists and have a simple implementation.

Extensions that require complex code to implement are, unfortunately, out of scope for this library. Any extension that requires a configuration option falls into this category.

@madig I'd be happy to support the dialect you've made for glyphsapp provided we can get it parsing without configuration options.

Malformed Files

@fstephany I'm not keen to make an effort to parse obviously malformed files as I don't want people to rely on how this library interprets them, which could change as bugs are fixed or new extensions added. I'd rather we work to explicitly support various common dialects (per the limitations above).

madig commented 4 years ago

@ebarnard thanks for the feedback. I understand that if I wanted any crazy parsing, I'd need to make my own library (I did not come up with that dialect, I just need to parse it :grin:). I think the code should be able to handle the Glyphs dialect by simply not interpreting strings as anything and instead leaving that to client code. I think that is how the specification intends it. What may be needed is slightly more complex parsing of strings like marked in https://github.com/fonttools/openstep-plist/blob/master/tests/test_parser.py#L135-L152. I can see if I can implement that when I get some time. Maybe a fuzzer could find even more corner cases :thinking:

ebarnard commented 4 years ago

@madig Those examples all look to be supported by Apple's parser as well: https://pewpewthespells.com/blog/dangers_of_ascii_plists.html. It's probably worth reading the whole of CFOldStylePlist.c to check exactly what escapes it supports as \ escaped as \\ is notable absent from the code listing on the blog post.

The only thing we would be unable to support is unpaired surrogates are these are not supported in UTF8 and therefore are not surpported by Rust's str or String. As I can't imagine a reason for using them it's probably fine to error in that case.

madig commented 4 years ago

The test should probably be

    #[test]
    fn escaped_sequences_in_strings() {
        let plist = r#"{
            key1 = "va\"lue";
            key2 = 'va"lue';
            key3 = "va\a\b\f\n\r\t\v\"\nlue";
            key4 = "a\012b";
            key5 = "\\UD83D\\UDCA9";
            key6 = "\UD83D\UDCA9";
        }"#;
        let cursor = Cursor::new(plist.as_bytes());
        let streaming_parser = AsciiReader::new(cursor);
        let events: Vec<Event> = streaming_parser.map(|e| e.unwrap()).collect();

        let comparison = &[
            StartDictionary(None),
            String("key1".to_owned()),
            String(r#"va"lue"#.to_owned()),
            String("key2".to_owned()),
            String(r#"va"lue"#.to_owned()),
            String("key3".to_owned()),
            String("va\u{07}\u{08}\u{0C}\n\r\t\u{0B}\"\nlue".to_owned()),
            String("key4".to_owned()),
            String("a\nb".to_owned()),
            String("key5".to_owned()),
            String("\\UD83D\\UDCA9".to_owned()),
            String("key6".to_owned()),
            String("💩".to_owned()),
            EndCollection,
        ];

        assert_eq!(events, comparison);
    }

Parses with openstep-plist:

{'key1': 'va"lue',
 'key2': 'va"lue',
 'key3': 'va\x07\x08\x0c\n\r\t\x0b"\nlue',
 'key4': 'a\nb',
 'key5': '\\UD83D\\UDCA9',
 'key6': '💩'}

Next step: implement the equivalent of https://github.com/opensource-apple/CF/blob/master/CFOldStylePList.c#L123-L171. Decoding table for values 128-255 at https://github.com/fonttools/openstep-plist/blob/master/src/openstep_plist/parser.pyx#L87-L106.

anthrotype commented 1 year ago

What is left for this PR to reach mergeable state, besides fixing the merge conflicts? Is @fstephany still interested in picking this one up?

ebarnard commented 1 year ago

I think the only thing missing was correct handling of escapes per @madig's comment, and some tests for those escapes.

fstephany commented 1 year ago

Oh, I forgot about this. Sorry for letting it slip :/ I've forgot about the details, @anthrotype go ahead if you need it!

steven-joruk commented 7 months ago

I'd like to see this feature land. @anthrotype @fstephany if you don't mind I'd be happy to finish it off, and if either of you have any branches with outstanding work I can take it from there, let me know.

steven-joruk commented 7 months ago

I still need to add escape handling and I'll go through all the previous comments to check there's nothing else, but wanted to point out two things. My branch here.

The commit that fixes one of the serde tests highlighted that documents that don't begin with exactly <?xml will be treated as ascii, which might not be desirable.

The master branch is broken due to denying warnings and a deprecation, I updated to swap_remove in my branch.

error: use of deprecated method `indexmap::IndexMap::<K, V, S>::remove`: `remove` disrupts the map order -- use `swap_remove` or `shift_remove` for explicit behavior.
  --> src/dictionary.rs:70:18
   |
70 |         self.map.remove(key)
steven-joruk commented 7 months ago

My branch has been updated to include the escape sequence handling and the unicode mappings. It passes the tests provided by madig.

I had to allow escaping \ (\\) because the test that parses netnewswire.pbxproj fails without it.

@ebarnard Would you prefer me to open a new PR? I don't think I'll be able to address any code review comments without assistance if we continue in this PR.

pronebird commented 7 months ago

It would be amazing to see this land. It would be very useful with some of apple CLIs like launchctl that return old style plists in stdout.

ebarnard commented 7 months ago

@steven-joruk If you can't push to this branch you should definitely open a new PR and I'll look at it afresh. I can't remember much of this one.

ebarnard commented 3 months ago

Closed via #136.