Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Test including unicode characters #86

Closed bmschmidt closed 7 years ago

bmschmidt commented 8 years ago

There needs to be a basic test that includes unicode characters. I just realized that most things outside of ASCII were being silently dropped. Maybe a Hebrew bible, or something.

theaidenlab commented 8 years ago

This would explain the issues we had trying to build the Sefaria bookworm.

Whadup commented 7 years ago

maybe this is a good starting point? Each line in the downloaded quran.txt should be a document. http://tanzil.net/download

bmschmidt commented 7 years ago

Thanks for the bump here. That Quran's license forbids alteration, so I'm disinclined to chop it up for a test suite.

But I've added a test suite with just a little Arabic and a little Cherokee, so hopefully we can't accidentally lose unicode support again.