brianmario / charlock_holmes

Character encoding detection, brought to you by ICU
MIT License
1.03k stars 140 forks source link

Best practice for large files? #159

Open machty opened 3 years ago

machty commented 3 years ago

The charlock_holmes API seems to be string centric, but if have a 50mb file which mostly consists of typical alphabetic/ASCII characters but only has a few non-ASCII characters to distinguish the encodings, what's the best way detect the entire file's encoding without loading the 50mb file (or larger) into memory?