Open rstm-sf opened 4 years ago
In the Status Log, the following metrics are the same:
SBCS 0.8360017: [iso-8859-15] SBCS: 0.8360017 [iso-8859-15]
SBCS 0.8360017: [iso-8859-1] SBCS: 0.8360017 [iso-8859-1]
SBCS 0.8360017: [windows-1252] SBCS: 0.8360017 [windows-1252]
It corresponds to one language:
Also, the same metrics are present in the log in other languages
As I understand it, in this case it is easier to get the same statistics https://en.wikipedia.org/wiki/ISO-8859-1#Similar_character_sets
Can we come up with a workaround or will we have to do as in #80?
It seems that in order to maintain the ability to further define encodings, we need to change the API so that a collection of objects is returned. Thus, we can return the same encodings
So we could fix this with a breaking change?
As far as I remember, the last thing I thought about it was to look at the compilation of coefficients for a more accurate detection... but it seems that this is not an easy task
The proposed option, with the return of similar encodings, is only a possible workaround
@rstm-sf could we fix this for 3.0?
Hello!
Instead of the encoding 'iso-8859-1' is defined 'iso-8859-15'.
file iso-8859-1.txt from uchardet test
Status Log
Get confidence: -- new match found: confidence 0.01, index 0, charset windows-1251. -- new match found: confidence 0.05902827, index 6, charset iso-8859-7. -- new match found: confidence 0.067115635, index 13, charset tis-620. -- new match found: confidence 0.3858822, index 15, charset iso-8859-1. -- new match found: confidence 0.40375984, index 18, charset iso-8859-1. -- new match found: confidence 0.41295946, index 21, charset iso-8859-2. -- new match found: confidence 0.42356956, index 23, charset iso-8859-1. -- new match found: confidence 0.8360017, index 32, charset iso-8859-15. Get confidence done. SBCS Group Prober --------begin status SBCS 0.01: [windows-1251] SBCS: 0.01 [windows-1251] SBCS 0.01: [koi8-r] SBCS: 0.01 [koi8-r] SBCS 0.01: [iso-8859-5] SBCS: 0.01 [iso-8859-5] SBCS 0.01: [x-mac-cyrillic] SBCS: 0.01 [x-mac-cyrillic] SBCS 0.01: [ibm866] SBCS: 0.01 [ibm866] SBCS 0.01: [ibm855] SBCS: 0.01 [ibm855] SBCS 0.05902827: [iso-8859-7] SBCS: 0.05902827 [iso-8859-7] SBCS 0.05902827: [windows-1253] SBCS: 0.05902827 [windows-1253] SBCS 0.01: [iso-8859-5] SBCS: 0.01 [iso-8859-5] SBCS 0: [windows-1251] SBCS: 0.00 [windows-1251] SBCS 0: [windows-1255] HEB: 0 - 0 [Logical-Visual score] SBCS 0: [windows-1255] SBCS: 0.00 [windows-1255] SBCS 0: [windows-1255] SBCS: 0.00 [windows-1255] SBCS 0.067115635: [tis-620] SBCS: 0.06711563 [tis-620] SBCS 0.067115635: [iso-8859-11] SBCS: 0.06711563 [iso-8859-11] SBCS 0.3858822: [iso-8859-1] SBCS: 0.3858822 [iso-8859-1] SBCS 0.3858822: [iso-8859-15] SBCS: 0.3858822 [iso-8859-15] SBCS 0.3858822: [windows-1252] SBCS: 0.3858822 [windows-1252] SBCS 0.40375984: [iso-8859-1] SBCS: 0.4037598 [iso-8859-1] SBCS 0.40375984: [iso-8859-15] SBCS: 0.4037598 [iso-8859-15] SBCS 0.40375984: [windows-1252] SBCS: 0.4037598 [windows-1252] SBCS 0.41295946: [iso-8859-2] SBCS: 0.4129595 [iso-8859-2] SBCS 0.41295946: [windows-1250] SBCS: 0.4129595 [windows-1250] SBCS 0.42356956: [iso-8859-1] SBCS: 0.4235696 [iso-8859-1] SBCS 0.42356956: [windows-1252] SBCS: 0.4235696 [windows-1252] SBCS 0.41898435: [iso-8859-3] SBCS: 0.4189844 [iso-8859-3] SBCS 0.38790238: [iso-8859-3] SBCS: 0.3879024 [iso-8859-3] SBCS 0.38790238: [iso-8859-9] SBCS: 0.3879024 [iso-8859-9] SBCS inactive: [iso-8859-6] (i.e. confidence is too low). SBCS 0: [windows-1256] SBCS: 0.00 [windows-1256] SBCS 0.16577692: [viscii] SBCS: 0.1657769 [viscii] SBCS 0.18163893: [windows-1258] SBCS: 0.1816389 [windows-1258] SBCS 0.8360017: [iso-8859-15] SBCS: 0.8360017 [iso-8859-15] SBCS 0.8360017: [iso-8859-1] SBCS: 0.8360017 [iso-8859-1] SBCS 0.8360017: [windows-1252] SBCS: 0.8360017 [windows-1252] SBCS 0.43422332: [iso-8859-13] SBCS: 0.4342233 [iso-8859-13] SBCS 0.40545458: [iso-8859-10] SBCS: 0.4054546 [iso-8859-10] SBCS 0.40545458: [iso-8859-4] SBCS: 0.4054546 [iso-8859-4] SBCS 0.42485002: [iso-8859-13] SBCS: 0.42485 [iso-8859-13] SBCS 0.42485002: [iso-8859-10] SBCS: 0.42485 [iso-8859-10] SBCS 0.42485002: [iso-8859-4] SBCS: 0.42485 [iso-8859-4] SBCS 0.366608: [iso-8859-1] SBCS: 0.366608 [iso-8859-1] SBCS 0.366608: [iso-8859-9] SBCS: 0.366608 [iso-8859-9] SBCS 0.366608: [iso-8859-15] SBCS: 0.366608 [iso-8859-15] SBCS 0.366608: [windows-1252] SBCS: 0.366608 [windows-1252] SBCS 0.36032423: [iso-8859-3] SBCS: 0.3603242 [iso-8859-3] SBCS 0.3647504: [windows-1250] SBCS: 0.3647504 [windows-1250] SBCS 0.3647504: [iso-8859-2] SBCS: 0.3647504 [iso-8859-2] SBCS 0.42094523: [MAC-CENTRALEUROPE] SBCS: 0.4209452 [MAC-CENTRALEUROPE] SBCS 0.40236503: [ibm852] SBCS: 0.402365 [ibm852] SBCS 0.32631624: [windows-1250] SBCS: 0.3263162 [windows-1250] SBCS 0.32631624: [iso-8859-2] SBCS: 0.3263162 [iso-8859-2] SBCS 0.40557358: [MAC-CENTRALEUROPE] SBCS: 0.4055736 [MAC-CENTRALEUROPE] SBCS 0.36612508: [ibm852] SBCS: 0.3661251 [ibm852] SBCS 0.35397846: [windows-1250] SBCS: 0.3539785 [windows-1250] SBCS 0.35397846: [iso-8859-2] SBCS: 0.3539785 [iso-8859-2] SBCS 0.41416448: [iso-8859-13] SBCS: 0.4141645 [iso-8859-13] SBCS 0.33398414: [iso-8859-16] SBCS: 0.3339841 [iso-8859-16] SBCS 0.3964395: [MAC-CENTRALEUROPE] SBCS: 0.3964395 [MAC-CENTRALEUROPE] SBCS 0.43202174: [ibm852] SBCS: 0.4320217 [ibm852] SBCS 0.42139196: [iso-8859-1] SBCS: 0.421392 [iso-8859-1] SBCS 0.42139196: [iso-8859-4] SBCS: 0.421392 [iso-8859-4] SBCS 0.42139196: [iso-8859-9] SBCS: 0.421392 [iso-8859-9] SBCS 0.42139196: [iso-8859-13] SBCS: 0.421392 [iso-8859-13] SBCS 0.42139196: [iso-8859-15] SBCS: 0.421392 [iso-8859-15] SBCS 0.42139196: [windows-1252] SBCS: 0.421392 [windows-1252] SBCS 0.42121872: [iso-8859-1] SBCS: 0.4212187 [iso-8859-1] SBCS 0.42121872: [iso-8859-3] SBCS: 0.4212187 [iso-8859-3] SBCS 0.42121872: [iso-8859-9] SBCS: 0.4212187 [iso-8859-9] SBCS 0.42121872: [iso-8859-15] SBCS: 0.4212187 [iso-8859-15] SBCS 0.42121872: [windows-1252] SBCS: 0.4212187 [windows-1252] SBCS 0.36684126: [windows-1250] SBCS: 0.3668413 [windows-1250] SBCS 0.36684126: [iso-8859-2] SBCS: 0.3668413 [iso-8859-2] SBCS 0.40297794: [iso-8859-13] SBCS: 0.4029779 [iso-8859-13] SBCS 0.37994418: [iso-8859-16] SBCS: 0.3799442 [iso-8859-16] SBCS 0.40297794: [MAC-CENTRALEUROPE] SBCS: 0.4029779 [MAC-CENTRALEUROPE] SBCS 0.4339976: [ibm852] SBCS: 0.4339976 [ibm852] SBCS 0.42192674: [windows-1252] SBCS: 0.4219267 [windows-1252] SBCS 0.42192674: [windows-1257] SBCS: 0.4219267 [windows-1257] SBCS 0.42192674: [iso-8859-4] SBCS: 0.4219267 [iso-8859-4] SBCS 0.42192674: [iso-8859-13] SBCS: 0.4219267 [iso-8859-13] SBCS 0.42192674: [iso-8859-15] SBCS: 0.4219267 [iso-8859-15] SBCS 0.38324198: [iso-8859-1] SBCS: 0.383242 [iso-8859-1] SBCS 0.38324198: [iso-8859-9] SBCS: 0.383242 [iso-8859-9] SBCS 0.38324198: [iso-8859-15] SBCS: 0.383242 [iso-8859-15] SBCS 0.38324198: [windows-1252] SBCS: 0.383242 [windows-1252] SBCS 0.40346685: [windows-1250] SBCS: 0.4034669 [windows-1250] SBCS 0.40346685: [iso-8859-2] SBCS: 0.4034669 [iso-8859-2] SBCS 0.40346685: [iso-8859-16] SBCS: 0.4034669 [iso-8859-16] SBCS 0.4482638: [ibm852] SBCS: 0.4482638 [ibm852] SBCS 0.4214702: [windows-1250] SBCS: 0.4214702 [windows-1250] SBCS 0.4214702: [iso-8859-2] SBCS: 0.4214702 [iso-8859-2] SBCS 0.4214702: [iso-8859-16] SBCS: 0.4214702 [iso-8859-16] SBCS 0.4214702: [MAC-CENTRALEUROPE] SBCS: 0.4214702 [MAC-CENTRALEUROPE] SBCS 0.4533166: [ibm852] SBCS: 0.4533166 [ibm852] SBCS 0.60846615: [iso-8859-1] SBCS: 0.6084661 [iso-8859-1] SBCS 0.60846615: [iso-8859-4] SBCS: 0.6084661 [iso-8859-4] SBCS 0.60846615: [iso-8859-9] SBCS: 0.6084661 [iso-8859-9] SBCS 0.60846615: [iso-8859-15] SBCS: 0.6084661 [iso-8859-15] SBCS 0.60846615: [windows-1252] SBCS: 0.6084661 [windows-1252] SBCS Group found best match [iso-8859-15] confidence 0.8360017.