.text() fails to read an element's text if it contains just a Compatibility Ideograph, or fails to read it correctly if it starts with one

wareya commented 5 years ago

<character>
<literal>欄</literal>
<literal>欄</literal>
</character>

For some reason .text() on a Node of the second tag fails to read the 欄. 欄 happens to be a compatibility codepoint for 欄 so I dropped 欄 in there as well. 欄 doesn't cause the error. The location doesn't matter. This is a cut down failure case I ran into trying to get some data out of a 15 megabyte (in XML) dictionary, no problems at all until this character, which is very close to the end of it.

code:

use std::fs::File;
use std::io::Read;
use std::collections::HashMap;

extern crate roxmltree;

fn load_to_string(fname : &str) -> std::io::Result<String>
{
    let mut file = File::open(fname)?;
    let mut string = String::new();
    file.read_to_string(&mut string)?;
    return Ok(string);
}

fn main() -> Result<(), std::io::Error>
{
    let kanjidic = load_to_string("kanjidic2.xml")?;
    println!("{}", kanjidic);
    let mut mapping = HashMap::<String, i64>::new();
    match roxmltree::Document::parse(&kanjidic) {
        Ok(doc) =>
        {
            for character in doc.root().descendants().filter(|element| element.has_tag_name("character"))
            {
                for property in character.descendants().filter(|element| element.is_element())
                {
                    if property.has_tag_name("literal")
                    {
                        if let Some(text) = property.text()
                        {
                            //
                        }
                        else
                        {
                            panic!("literal at line {} position {} does not have recognizable text", property.node_pos().row, property.node_pos().col);
                        }
                    }
                }
            }
        }
        Err(e) =>
        {
            panic!("failed to parse: {:?}", e);
        }
    }

    Ok(())
}

wareya commented 5 years ago

Looks like it breaks on other compatibility ideographs too, like 蘭 and 卵.

If the character is preceded by ascii, like \ 卵\, it finds the whole text. If it's succeeded by ascii, like \卵 \, it fails to find the whole text and just silently returns just the ascii.