freedmand / semantra

Multi-tool for semantic search
MIT License
2.49k stars 139 forks source link

Auto-detect encoding? #50

Open endolith opened 1 year ago

endolith commented 1 year ago

Documentation says

  • --encoding: Encoding to use for reading text files [default: utf-8]

But different files have different encodings. Chinese PDF is being read correctly and characters are showing up correctly, but a .txt file in the same folder that's encoded in GB2312 is being garbled in both the search results and the file display.

Probably it should default to detecting the encoding for each file independently and then converting them internally to whatever the embedding expects (UTF8?)

https://pypi.org/project/chardet/

freedmand commented 1 year ago

Yep, you're absolutely right. This should be granular on a per-file basis. I can look into auto-detecting encoding, but that might be time consuming for ever file, and it might be error prone. In any case, v0.2 should have better controls for customizing how Semantra works per file.