Open endolith opened 1 year ago
Yep, you're absolutely right. This should be granular on a per-file basis. I can look into auto-detecting encoding, but that might be time consuming for ever file, and it might be error prone. In any case, v0.2 should have better controls for customizing how Semantra works per file.
Documentation says
But different files have different encodings. Chinese PDF is being read correctly and characters are showing up correctly, but a .txt file in the same folder that's encoded in GB2312 is being garbled in both the search results and the file display.
Probably it should default to detecting the encoding for each file independently and then converting them internally to whatever the embedding expects (UTF8?)
https://pypi.org/project/chardet/