RazrFalcon / roxmltree

Represent an XML document as a read-only tree.
Apache License 2.0
430 stars 38 forks source link

.text() fails to read an element's text if it contains just a Compatibility Ideograph, or fails to read it correctly if it starts with one #3

Closed wareya closed 5 years ago

wareya commented 5 years ago
<character>
<literal>欄</literal>
<literal>欄</literal>
</character>

For some reason .text() on a Node of the second tag fails to read the 欄. 欄 happens to be a compatibility codepoint for 欄 so I dropped 欄 in there as well. 欄 doesn't cause the error. The location doesn't matter. This is a cut down failure case I ran into trying to get some data out of a 15 megabyte (in XML) dictionary, no problems at all until this character, which is very close to the end of it.

code:

use std::fs::File;
use std::io::Read;
use std::collections::HashMap;

extern crate roxmltree;

fn load_to_string(fname : &str) -> std::io::Result<String>
{
    let mut file = File::open(fname)?;
    let mut string = String::new();
    file.read_to_string(&mut string)?;
    return Ok(string);
}

fn main() -> Result<(), std::io::Error>
{
    let kanjidic = load_to_string("kanjidic2.xml")?;
    println!("{}", kanjidic);
    let mut mapping = HashMap::<String, i64>::new();
    match roxmltree::Document::parse(&kanjidic) {
        Ok(doc) =>
        {
            for character in doc.root().descendants().filter(|element| element.has_tag_name("character"))
            {
                for property in character.descendants().filter(|element| element.is_element())
                {
                    if property.has_tag_name("literal")
                    {
                        if let Some(text) = property.text()
                        {
                            //
                        }
                        else
                        {
                            panic!("literal at line {} position {} does not have recognizable text", property.node_pos().row, property.node_pos().col);
                        }
                    }
                }
            }
        }
        Err(e) =>
        {
            panic!("failed to parse: {:?}", e);
        }
    }

    Ok(())
}
wareya commented 5 years ago

Looks like it breaks on other compatibility ideographs too, like 蘭 and 卵.

If the character is preceded by ascii, like \ 卵\, it finds the whole text. If it's succeeded by ascii, like \卵 \, it fails to find the whole text and just silently returns just the ascii.

RazrFalcon commented 5 years ago

Hmm... libxml2 (via lxml) also parses it differently:

Document:
  - Element:
      tag_name: character
      children:
        - Text: "\n"
        - Element:
            tag_name: literal
            children:
              - Text: "\u6b04"
        - Element:
            tag_name: literal
            children:
              - Text: "\uf91d"
        - Text: "\n"

You can try it yourself using testing-tools/lxml-ast.py.

RazrFalcon commented 5 years ago

But it should be empty anyway...

RazrFalcon commented 5 years ago

Do'h! It's actually an xmlparser bug. 欄 starts with 0xEF, which xmlparser treats as UTF-8 BOM start... Will fix it soon.

RazrFalcon commented 5 years ago

Fixed via https://github.com/RazrFalcon/xmlparser/commit/409aedd1b4bf2d4b3695d9c6303ea241841b5de8

RazrFalcon commented 5 years ago

Published v0.2.0. It also has some breaking changes.