Closed bsolomon1124 closed 5 years ago
Update: here is a partial answer.
In pycldmodule.cc
,
detect(PyObject *self, PyObject *args, PyObject *kwArgs) {
char *bytes;
int numBytes;
// ...
if (!PyArg_ParseTupleAndKeywords(args, kwArgs, "s#|izzzziiiiiiii",
(char **) kwList,
&bytes, &numBytes
// ...
// ...
As documented here, the use of s#
for utf8Bytes
allows str or read-only bytes-like object
. So, that explains why the Python function itself does not complain when str
is passed to it, even though the pycld2
documentation instructs to pass bytes
.
If you want to only accept bytes, you would probably want y*
. ("This is the recommended way to accept binary data.") But see below as to why that's not actually want you would want here.
Now, to the second part of the question: is it actually okay to pass str
? The answer seems to be yes, because
Unicode objects are converted to C strings using 'utf-8' encoding.
In other words, whether a str
or bytes
is passed to detect()
, the resulting variable is still a pointer to a C string and is encoded via utf-8 as CLD expects. (ExtDetectLanguageSummaryCheckUTF8()
takes a const char* buffer
, which is exactly what it is passed.)
To summarize: it could be more clearly documented that detect()
accepts either:
str
bytes
that have been encoded with utf-8Simple example:
>>> import pycld2 as cld2
>>> s = "The recipe calls for precisely ¼ cup of flour"
>>> s_utf8 = s.encode("utf-8")
>>> s_l1 = s.encode("latin-1")
>>> cld2.detect(s, bestEffort=False)
(True, 45, (('ENGLISH', 'en', 97, 1117.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect(s_utf8, bestEffort=False)
(True, 45, (('ENGLISH', 'en', 97, 1117.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
>>> cld2.detect(s_l1, bestEffort=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
pycld2.error: input contains invalid UTF-8 around byte 31 (of 45)
Closed in e543fb6
From the docstring from
detect()
:However, in reality, the function also seems to accept
str
:Can you please clarify this ambiguity?
Update: I found this disclaimer in https://github.com/mikemccand/chromium-compact-language-detector:
It might be reasonable to do an
isinstance()
check before allowingstr
, withindetect()
.Info: