KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Unicode errors in corpus_token_model #10

Closed dginev closed 6 years ago

dginev commented 6 years ago

The corpus token model run I am finishing exited prematurely with:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 459, error_len: Some(1) }', libcore/result.rs:916:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

It could be time to look into the pull requests and integrate #6

dginev commented 6 years ago

This has been resolved, I believe mostly via the patches placed in rust-libxml, corpus_token_model succeeds gracefully with the 08.2017 arXMLiv dataset now