brianmario / charlock_holmes

Character encoding detection, brought to you by ICU
MIT License
1.03k stars 140 forks source link

Ruby v2.7.7 conflicts with Charlock_holmes converting file from euc_kr (or CP949) to uft-8 #168

Closed erados closed 10 months ago

erados commented 1 year ago

I'm encountering an error where U_FILE_ACCESS_ERROR is being returned.

I suspect that the modification to the file.c file in Ruby v2.7.7 might be the root cause of this problem. Interestingly, I didn't encounter any issues when using earlier versions of Ruby, specifically those prior to 2.7.7.

Please don't hesitate to reach out if you require additional information.

For your reference, I've attached a test file that showcases the issue. https://drive.google.com/file/d/12-Zs8KWlu5A5U-WcyMkdUvdj14Qhm6F6/view?usp=drive_link

erados commented 10 months ago

For those who are facing a similar issue, I'd like to share that the primary challenge revolves around Charlock Holmes' inability to accurately determine the encoding with complete certainty, particularly when the file contains only a limited amount of data.

To address this, I've devised a solution that involves a logic-based approach. This entails making encoding decisions based on two key factors: the confidence level provided by Charlock Holmes and the language header that the user specifies.

By integrating these considerations, I've managed to successfully overcome the issue. This solution offers a balanced approach to accurately determining the encoding in cases where the data may be limited.

Feel free to adopt this strategy if you encounter similar encoding-related challenges and please share yours too!