aboSamoor / pycld2

Apache License 2.0
165 stars 63 forks source link

Is it ok to pass str to detect(), or just bytes? #18

Closed bsolomon1124 closed 5 years ago

bsolomon1124 commented 5 years ago

From the docstring from detect():

utf8Bytes: The text to detect, encoded as UTF-8 bytes (required). If this is not valid UTF-8, then an cld2.error is raised.

However, in reality, the function also seems to accept str:

>>> cld2.detect("hello there, this is English!")
(True, 29, (('ENGLISH', 'en', 96, 1426.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))

Can you please clarify this ambiguity?

Update: I found this disclaimer in https://github.com/mikemccand/chromium-compact-language-detector:

NOTE: you must pass only valid UTF-8 bytes to the detect function, otherwise you can hit segmentation fault or get incorrect results.

It might be reasonable to do an isinstance() check before allowing str, within detect().


Info:

>>> import sys; sys.version_info
sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
bsolomon1124 commented 5 years ago

Update: here is a partial answer.

In pycldmodule.cc,

detect(PyObject *self, PyObject *args, PyObject *kwArgs) {
  char *bytes;
  int numBytes;
  // ...
  if (!PyArg_ParseTupleAndKeywords(args, kwArgs, "s#|izzzziiiiiiii",
                                   (char **) kwList,
                                   &bytes, &numBytes
                                   // ...
  // ...

As documented here, the use of s# for utf8Bytes allows str or read-only bytes-like object. So, that explains why the Python function itself does not complain when str is passed to it, even though the pycld2 documentation instructs to pass bytes.

If you want to only accept bytes, you would probably want y*. ("This is the recommended way to accept binary data.") But see below as to why that's not actually want you would want here.

Now, to the second part of the question: is it actually okay to pass str? The answer seems to be yes, because

Unicode objects are converted to C strings using 'utf-8' encoding.

In other words, whether a str or bytes is passed to detect(), the resulting variable is still a pointer to a C string and is encoded via utf-8 as CLD expects. (ExtDetectLanguageSummaryCheckUTF8() takes a const char* buffer, which is exactly what it is passed.)

bsolomon1124 commented 5 years ago

To summarize: it could be more clearly documented that detect() accepts either:

Simple example:

>>> import pycld2 as cld2
>>> s = "The recipe calls for precisely ¼ cup of flour"
>>> s_utf8 = s.encode("utf-8")
>>> s_l1 = s.encode("latin-1")
>>> cld2.detect(s, bestEffort=False)
(True, 45, (('ENGLISH', 'en', 97, 1117.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect(s_utf8, bestEffort=False)
(True, 45, (('ENGLISH', 'en', 97, 1117.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect(s_l1, bestEffort=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycld2.error: input contains invalid UTF-8 around byte 31 (of 45)
bsolomon1124 commented 5 years ago

Closed in e543fb6