Open konstantinblaesi opened 6 years ago
@dachev Can you explain why there is such difference? Is it safe to use CLD on Linux in prod?
Same problem. For example "Черепашка" is defined on Windows but there is an error on Linux? Is it possible to fix this problem?
@vartemkin can you try the latest version (2.10.0) There is a new option called bestEffort
that might help.
Yes, I'm already trying it on her. I corrected the C++ code to add the verbose flag:
...
if (input->httpHint.length() > 0) {
hints.content_language_hint = input->httpHint.c_str();
}
int flags = CLD2::kCLDFlagVerbose;
if (input->bestEffort) {
flags |= CLD2::kCLDFlagBestEffort;
}
printf("\n");
const char * cc = (const char*)input->bytes.c_str();
for (int i=0; i<input->numBytes; i++) printf("%d",cc[i]);
printf("\n");
CLD2::ExtDetectLanguageSummary(...
windows 10:
$ node 1.js
-48-100-48-75-48-77-48-80-48-68-48-72-48-70-47-127
<br>ScoreOneScriptSpan(Cyrl,18) ' ╨╝╨╡╨│╨░╨╝╨╕╨║╤Б '<br>
Hitbuffer[) <br>DumpHitBuffer[Cyrl, next_base/delta/distinct 2, 0, 0)<br>
Q[0]1,28463,╨╝╨╡╨│ <br>
Q[1]9,2836,╨╝╨╕╨║ <br>
<br>
Linear[) <br>DumpLinearBuffer[3)<br>
[0]1,Q=00000400,╨╝╨╡╨│<br>
[1]1,Q=0704350c,╨╝╨╡╨│<br>
[2]9,Q=07040aa9,╨╝╨╕╨║<br>
[3]18,U=00000000, <br>
<br>
DumpChunkStart[1]<br>
[0]0
[1]3
<br>
<br>ScoreOneChunk[0..3) <br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] ru.9 bg.7 17B 3# Cyrl 36Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
<br>SharpenBoundaries<br>
<br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] ru.9 bg.7 17B 3# Cyrl 36Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
RUSSIAN (ru) (94%)
ubuntu:
ubuntu@ubuntu-desktop:~/Desktop/test$ node 1.js
-48-100-48-75-48-77-48-80-48-68-48-72-48-70-47-127
<br>ScoreOneScriptSpan(Cyrl,18) ' мегамикс '<br>
Hitbuffer[) <br>DumpHitBuffer[Cyrl, next_base/delta/distinct 2, 0, 0)<br>
Q[0]1,31373,мег <br>
Q[1]9,30711,мик <br>
<br>
Linear[) <br>DumpLinearBuffer[3)<br>
[0]1,Q=00000400,мег<br>
[1]1,Q=3500151b,мег<br>
[2]9,Q=07040aa9,мик<br>
[3]18,U=00000000, <br>
<br>
DumpChunkStart[1]<br>
[0]0
[1]3
<br>
<br>ScoreOneChunk[0..3) <br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
<br>SharpenBoundaries<br>
<br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
{ reliable: false, textBytes: 18, languages: [], chunks: [] }
/home/ubuntu/Desktop/test/node_modules/cld/index.js:77
throw new Error('Failed to identify language');
^
Error: Failed to identify language
at Object.detect (/home/ubuntu/Desktop/test/node_modules/cld/index.js:77:15)
at async main (/home/ubuntu/Desktop/test/1.js:3:16)
Node.js v20.17.0
@dachev problem started in this line [0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs
I know that short snippets are likely to fail the language detection, but I found it confusing that the snippet
was detected as
en
on windows, but failed on linux with the error messageFailed to identify language
. Do you have any idea why the cld2 behaviour is not consistent across platforms?