dachev / node-cld

Language detection for Javascript (Node). Based on the CLD2 (Compact Language Detector) library from Google.
Apache License 2.0
314 stars 55 forks source link

Windows and Linux show different results with short snippets #44

Open konstantinblaesi opened 6 years ago

konstantinblaesi commented 6 years ago

I know that short snippets are likely to fail the language detection, but I found it confusing that the snippet

Best practices

was detected as en on windows, but failed on linux with the error message Failed to identify language. Do you have any idea why the cld2 behaviour is not consistent across platforms?

kibertoad commented 1 year ago

@dachev Can you explain why there is such difference? Is it safe to use CLD on Linux in prod?

vartemkin commented 1 month ago

Same problem. For example "Черепашка" is defined on Windows but there is an error on Linux? Is it possible to fix this problem?

dachev commented 1 month ago

@vartemkin can you try the latest version (2.10.0) There is a new option called bestEffort that might help.

vartemkin commented 1 month ago

Yes, I'm already trying it on her. I corrected the C++ code to add the verbose flag:

...
if (input->httpHint.length() > 0) {
      hints.content_language_hint = input->httpHint.c_str();
    }
    int flags = CLD2::kCLDFlagVerbose;
    if (input->bestEffort) {
      flags |= CLD2::kCLDFlagBestEffort;
    }

    printf("\n");
  const char * cc = (const char*)input->bytes.c_str();
  for (int i=0; i<input->numBytes; i++) printf("%d",cc[i]);
  printf("\n");

    CLD2::ExtDetectLanguageSummary(...

windows 10:

$ node 1.js

-48-100-48-75-48-77-48-80-48-68-48-72-48-70-47-127
<br>ScoreOneScriptSpan(Cyrl,18) ' ╨╝╨╡╨│╨░╨╝╨╕╨║╤Б '<br>
Hitbuffer[) <br>DumpHitBuffer[Cyrl, next_base/delta/distinct 2, 0, 0)<br>
Q[0]1,28463,╨╝╨╡╨│ <br>
Q[1]9,2836,╨╝╨╕╨║ <br>
<br>
Linear[) <br>DumpLinearBuffer[3)<br>
[0]1,Q=00000400,╨╝╨╡╨│<br>
[1]1,Q=0704350c,╨╝╨╡╨│<br>
[2]9,Q=07040aa9,╨╝╨╕╨║<br>
[3]18,U=00000000,   <br>
<br>
DumpChunkStart[1]<br>
[0]0
[1]3
<br>
<br>ScoreOneChunk[0..3) <br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] ru.9 bg.7 17B 3# Cyrl 36Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
<br>SharpenBoundaries<br>
<br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] ru.9 bg.7 17B 3# Cyrl 36Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
RUSSIAN (ru) (94%)

ubuntu:

ubuntu@ubuntu-desktop:~/Desktop/test$ node 1.js

-48-100-48-75-48-77-48-80-48-68-48-72-48-70-47-127
<br>ScoreOneScriptSpan(Cyrl,18) ' мегамикс '<br>
Hitbuffer[) <br>DumpHitBuffer[Cyrl, next_base/delta/distinct 2, 0, 0)<br>
Q[0]1,31373,мег <br>
Q[1]9,30711,мик <br>
<br>
Linear[) <br>DumpLinearBuffer[3)<br>
[0]1,Q=00000400,мег<br>
[1]1,Q=3500151b,мег<br>
[2]9,Q=07040aa9,мик<br>
[3]18,U=00000000,   <br>
<br>
DumpChunkStart[1]<br>
[0]0
[1]3
<br>
<br>ScoreOneChunk[0..3) <br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
<br>SharpenBoundaries<br>
<br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
{ reliable: false, textBytes: 18, languages: [], chunks: [] }
/home/ubuntu/Desktop/test/node_modules/cld/index.js:77
        throw new Error('Failed to identify language');
              ^

Error: Failed to identify language
    at Object.detect (/home/ubuntu/Desktop/test/node_modules/cld/index.js:77:15)
    at async main (/home/ubuntu/Desktop/test/1.js:3:16)

Node.js v20.17.0
vartemkin commented 1 month ago

@dachev problem started in this line [0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs