Closed thalassemia closed 1 year ago
Wow! So checking if there is a '_type_marker'
string attribute is taking a ridiculous amount of time, which is reduced by just using a Python unicode string instead? That seems too good to be true. I'm not sure how this influences the C, but I'd imagine you have to be very careful about managing the references.
Yeah, PyObject_HasAttrString
first creates a new str
PyObject*
by calling PyUnicode_FromString
under the hood (function linked here). Since we know what the string will be here, we can make the object in advance and reuse it to save a lot of time, especially since this function sees so much traffic. I don't know what the memory implications are but I can at least confirm there's no obvious leak (no ballooning memory usage) like in my initial attempts.
oh, interesting, so it has been just creating that new Python string every time instead of reusing it. Yeah, that seems like a very clean fix.
I'm also not really familiar with C, but this makes sense to me. Surprising that it makes such a big difference!
Oops, did not mean to close haha
I'm splitting hairs at this point, but there is another section of frequently used code where they use snprintf
to convert a counter int
inside a for loop to string
. If they just cached these somehow, there's another 30-40% performance improvement to be had. The following code illustrates what I mean. Of course, we don't know what items
is until runtime and it could be greater than size
, so a more sophisticated caching mechanism is needed.
int size = 400000
static char str_cache[size][16];
static int function(int items) {
int i;
for(i = 0; i < items; i++) {
// replace the following
// char name[16];
// snprintf((name), sizeof((name)), "%d", ((int)i));
char *name = str_cache[i]
... // do stuff with name
}
}
int main() {
int i;
for (i = 0; i < size; i++) {
snprintf((str_cache[i]), sizeof((str_cache[i])), "%d", ((int)i));
}
// call function a lot here
}
Great detective work!
Using the Python C API requires baby-sitting many subtle requirements. It's nice when we can let Cython generate that code or at least generate code to crib from.
AFAIK, the TYPEMARKERSTR
is OK. It's just one str instance that can leak if the cbson
module can/does get unloaded. The proper fix might be to keep it in the cbson
module object rather than a static variable, but the existing _type_marker()
function doesn't have a reference to the module.
Note: PyObject_HasAttrString() has a fast path that calls the C method *Py_TYPE(v)->tp_getattr(v, (char*)name)
without having to convert the name to a Python str. See _PyObject_LookupAttr().
Thus an alternative and potentially more comprehensive optimization would be to make the BSON extension types implement tp_getattr
. (Caveat: tp_getattr
is deprecated and apparently no longer implemented by the built-in Python classes. That doesn't matter if _type_marker(PyObject* object)
is nearly always called on the BSON extension classes.] -- Scratch that idea. The BSON classes are ordinary Python classes, not C++ extension classes.
int64 _type_marker
So it's totally worth reporting this PyMongo issue.
Maybe also mention that further down the _type_marker()
function is some pointless code:
/*
* Py(Long|Int)_AsLong returns -1 for error but -1 is a valid value
* so we call PyErr_Occurred to differentiate.
*/
if (type == -1 && PyErr_Occurred()) {
return -1; // [return -1 instead of -1]
}
...
return type;
Thanks Jerry! I've just reported the issue, and it can be viewed at this link. What do you think about the other optimization? The performance gain is slimmer there, and it comes at the cost of higher memory usage.
Nice!
The other optimization idea has good potential.
One idea is to change snprintf(... "%d" ...)
to itoa(i)
so it doesn't have to interpret a format string every time. It turns out that itoa
is not in the C standard so they're better off copying a well-tested implementation.
If the same integer values repeat often enough, a cache is a good approach. E.g. memoization, where each cache "entry" has a (number, string) pair. To convert the value i
it'd reuse or else update cache entry n % i
.
FYI the C language tries to pretend that an array is equivalent to a pointer, except when it isn't such as inside another array or a struct. So I'd probably define the string buffers as an array of structs, where each struct holds a char array. The input numbers could go in those structs or in a parallel array.
The itoa
implementation you linked worked great! For a 2000-second simulation, my pre-built array solution took 244 seconds, PyMongo's current INT2STRING
macro took 276 seconds, and your faster itoa
took 250 seconds. My profiling shows that the insert_one
call is over 2x faster with the optimized itoa
.
I made a separate PyMongo issue for this that can be viewed here. Thanks again for all the input guys!
Nicely done on the issue reports!
They have lots of open issues and might process a PR much sooner.
In the meantime we could fork the mongo-python-driver repo and put it on PyPI. Or monkey-patch the library to load a different bson._cbson
extension.
Note: The bson
module implements a SON
class which seems to do a lot of work just to provide a dict that keeps its keys in insertion order. As of Python 3.7 -- the oldest supported version of Python -- the built-in dict
class keeps its keys in insertion-order.
I don't know how much impact this has on vivarium-ecoli, but the class can't compete with the built-in dict
in speed and size. Besides running a bunch of interpreted Python code, has_key()
is O(n) rather than O(1) since it has to scan a list, and to_dict()
has to recursively transform nested elements.
You could monkey-patch the class to measure the performance impact:
import bson
bson.SON = dict # do this before any other module imports bson
>>> import bson
>>> bson.SON()
SON([])
>>> bson.SON = dict # verify the monkey-patch
>>> bson.SON()
{}
Thanks for the monkey-patch suggestion. In my testing, this results in an additional 15% improvement in PyMongo insert_one
performance! Also, I didn't realize I could make PRs directly on the PyMongo repo until you pointed it out. Those are now up: https://github.com/mongodb/mongo-python-driver/pull/1221 https://github.com/mongodb/mongo-python-driver/pull/1219
Awesome!
A PR for the SON
speedup could replace the contents of son.py
with an updating docstring and SON = dict
. The setup.py
file already specifies that this pip python_requires=">=3.7"
.
Learned something interesting while running their unit tests with SON = dict
. Two dictionaries that contain the same data but in a different order are considered equal using ==
. This breaks one of their tests and might be an important feature of their SON
implementation.
We can fix that. (Tests are good.)
from typing import Any, Iterator
class SON(dict):
"""SON data.
A subclass of dict that provides a few extra niceties for dealing with SON
and some Python 2 dict methods. SON provides an API similar to
collections.OrderedDict, maintaining keys in insertion-order, which in
Python 3.7+ is provided by dict.
"""
def copy(self) -> SON:
return SON(self)
def has_key(self, key: Any) -> bool:
return key in self
def iterkeys(self) -> Iterator:
return iter(self.keys())
def itervalues(self) -> Iterator:
return iter(self.values())
def __eq__(self, other: Any) -> bool:
"""Comparison to another SON is insertion-order-sensitive while
comparison to another mapping is order-insensitive.
"""
if isinstance(other, SON):
return list(self.items()) == list(other.items())
return super().__eq__(other)
def __ne__(self, other: Any) -> bool:
return not self == other
def to_dict(self) -> dict:
return dict(self)
SON
semantics including __eq__()
and the obsolete Python 2 dict methods like has_key()
.to_dict()
is way faster by not recursively copying/transforming its contents since this SON
and dict
are more substitutable, but we should find out if they rely on that behavior.__deepcopy__
method here since there's no such Python "dunder" special method. They probably just meant to make it a private method, which should be spelled _deepcopy
._Key = TypeVar("_Key")
static typing, but I don't expect it'd catch any problems.I've gotten feedback on both of my PRs. It looks like there are licensing issues surrounding the MySQL code I adapted. Not sure how to proceed.
I also got great advice on the _type_marker
PR about adding my new PyObject
to the module_state
, where it would be automatically freed by the _cbson_clear
function on teardown. However, this proved more challenging to implement than I anticipated. Can you look over my code on that PR @1fish2?
Now that I've rerun my sims a couple of times, I think our sim runtimes with or without monkey-patching the SON
class are within the margin of error. Implementing it in PyMongo is also tricky because to support Python versions before 3.9, I'd need to add from __future__ import annotations
at the top of every file which uses SON
as a subscriptable type hint to enable PEP 563. You were right to suspect that the recursive nature of to_dict
may be important. There's a test for that which the new SON
fails.
This is very challenging.
Looking into the _type_marker
PR, this doc aims to explain module state. We might need to find clearer documentation or test if a sample Cython extension generates C code that can handle module state. I read stuff on subinterpreters and some of PEPs 489, 554, 573, and 3121. It's a tarpit that the Python maintainers are striving to fix.
I'll mention some of the issues, then some alternative ideas.
Problems:
PyModule_GetState(PyObject *module)
has minimal documentation but the argument name and the source code imply it takes a module object, not an instance of a class in that module, even if that value is sometimes named self
. See PyState_FindModule
below for a workaround.GETSTATE(self)
works on an instance of a class in that module, these _cbsonmodule.c
functions use the _type_marker
attribute to test if the given object is one of its own instances. Until a function successfully discovers that a given object is a cbsonmodule instance, the GETSTATE(self)
could point to some other module, so we can't use it._cbson_clear()
is part of GC cycle detection. It's called to deallocate module state (and Py_CLEAR()
doesn't deallocate) -- that was a bum steer. Rather, the PyModuleDef
m_free
slot is for deallocating module state. We could add an m_free
function and set that slot in the _cbsonmodule.c
module.
PyModuleDef .m_slots
, which means it doesn't do multi-phase init, which I think implies it's not "expected to support sub-interpreters" and doesn't get some of the newer run-time support in the above-mentioned PEPs to access module state. If it doesn't support sub-interpreters, storing the str
object in a C static variable should be fine.Alternative ideas:
str
once and pass it to both PyObject_HasAttr()
and PyObject_GetAttr()
, then call DECREF
. No module state tarpit.PyUnicode_FromString()
is counting up the string's chars then decoding variable-byte UTF-8 characters. We can instead call PyUnicode_DecodeASCII(TYPE_MARKER_NAME, TYPE_MARKER_LEN, NULL)
where we've statically computed the string length (12):
static const char * const TYPE_MARKER_NAME = "_type_marker";
static const size_t TYPE_MARKER_LEN = strlen(TYPE_MARKER_NAME);
_type_marker()
can find the module by calling module = PyState_FindModule(&module_state)
. It can then call GETSTATE(self)
to access module state. This will be slower than fetching a C static variable, enough slower that it might not be faster than the two no-state speedups above.module_state
by calling one of the support functions like PyModule_AddStringConstant(module, "_type_marker", "_type_marker")
. But I don't see how this saves time since fetching the constant seems to require calling PyDict_GetItemString(PyModule_GetDict(module), "_type_marker")
.I see that you wrote a replacement for the MySQL code to avoid the licensing issues. This is terrific and esp. that you included a unit test that checks boundary cases! Your implementation mostly looks good to me but there are details and cautions. C is tricky!
extern void itoa(long int num, char* str)
.extern
, not static
. static
means private to each compilation unit (compiled .o file). extern
means linked between compilation units. The .c file that implements an extern function must include the .h file which defines its function prototype, otherwise C cannot do the type checking.int10_to_str
if they're concerned about the license. Testing results against snprintf
is good.-num
for sure risks UB. -(unsigned long)num
might work, but I see that the MySQL code does what we can write more concisely as 0UL - (unsigned long)num
. unsigned long
is the same as unsigned long int
.absNum = (unsigned long)num;
not absNum = num;
to avoid warnings from some compilers.char
is signed or unsigned depends on the compiler. This code might be OK there but I'd explicitly cast (char)(digit + '0')
to at least avoid warnings from some compilers.digit = absNum % 10
when digit
is an int
and absNum
is an unsigned long
, both because of the signed/unsigned difference and the size difference.itoa
's str
argument memory.test_int2str.c
?const
or a macro for that size everywhere to ensure they stay consistent.Since monkey-patching the SON
class doesn't make a significant speedup, then I totally agree about dropping it.
FYI __future__ annotations
is still on hold in Python 3.11. It might never become the future.
Using a string literal as a type declaration avoids that makes the type declaration a single object (thus saving load time like "future" annotations were supposed to do), handles forward references (as in a method referring to its own class, "SON[_Key, _Value]"
), and avoids needing to import definitions just for types.
On second thought, it really should protect from buffer overrun. Apparently the INT2STRING
was called with a character array so it can get the buffer size to do this. And it should maintain the existing interface to the degree possible, not that the macro stated its interface. snprintf()
would return an int but the platform-specific macro definitions could differ on that, and it's not needed if the void function compiles everywhere that pymongo calls it.
itoa()
is a common built-in function and it doesn't take a buffer size, so this should not clash with that name. You might pick something other than what I wrote below.
So in the .h file:
#define INT2STRING(buffer, i) long_to_decimal((buffer), sizeof((buffer)), (i))
/* Converts an integer to its string representation in decimal notation. */
extern void long_to_decimal(char* buffer, size_t size, long num);
and the .c file:
include "_cbsonmodule.h"
extern void long_to_decimal(char* buffer, size_t size, long num) {
assert(size >= 20); // or some other way to fail? snprintf() does something more whacky
... // the rest as it was,
... // but do the type cast(s) as needed for buffer containing possibly signed chars, e.g.
unsigned char* str = (unsigned char*)buffer;
// ...
}
The test function shouldn't need to clear the buffers since this function shouldn't care what's in the output buffer going in. If I'm wrong about that, memset(str_1, 0, len(str_1))
is safer than memset(str_1, 0, strlen(str_1))
since strlen
will scan for a terminating \0
char and if goes past the end of the buffer, memset()
will cause havoc.
My _type_marker
PR just got merged in. Regarding the concerns you raised, both _cbson
and _cmessage
call PyModule_Create
with a PyModuleDef
(here and here) in their PyInit_{name}
functions. In its PyModuleDef
struct, _cbson_clear
is placed in the m_clear
slot. Is this not good enough to ensure that no memory is leaked? I don't quite understand the difference between m_clear
and m_free
. If I'm understanding the documentation correctly, Py_CLEAR
does not deallocate, but it should decrement the reference count of its argument and set it to NULL
, at which point the Python GC can work its magic. For functions that are registered in the method table (_CSBONMethods
and _CMessageMethods
), METHVARARGS
ensures that the first parameter self
for module functions is the module object. I think all functions outside the method table that expect a method object as the first argument are only called by functions that are included in the table and can pass on their reference to the module.
The tricky part for me was figuring out PyArg_ParseTuple
format strings (O&
for object conversion is a cool idea, but my final PR had to scrap it so convert_codec_options
could accept a module object as its first argument) and figuring out that the _cmessage
module (which calls convert_codec_options
in _cbson
and must therefore have a reference to the _cbson
module) already has a reference to the _cbson
module in its own module state.
Thanks for the INT2STRING
suggestions! I'll make those changes right away. The more I learn about C, the more I appreciate how good I've had it all these years with Python. The maintainers also want me to include a unit test in Python for my code. Do you know how to do that? I had intended for my test_int2str.c
script to be temporary just to convince them that my code works, because I don't know how to integrate it into their preexisting testing harness.
Now that I've done some more reading, here's my understanding difference between m_clear
and m_free
. m_clear
is only called when the GC detects a reference cycle, and its purpose is to break these cycles by decrementing reference counts and setting pointers to NULL
. As noted in the documentation, Python strings are contained objects that are never part of a reference cycle, so the line that I added to _cbson_clear
was unnecessary.
All objects of type PyModule_Type
perform deallocation by calling this general module_dealloc
destructor. If some special behavior is necessary to deallocate memory, it can be included in the m_free
function, which is called here in that destructor. The destructor then frees the memory allocated for the module state. This part confuses me because the _cbson
and _cmessage
module states contain PyObject*
pointers. If no reference cycle is detected and m_clear
is not called, does that mean that these objects never get their reference counts decremented and their underlying memory will never be freed even though their PyObject*
pointer has been deallocated? If so, would it fix the code to put _cbson_clear
but with Py_CLEAR
for every member of the module_state
struct as the m_free
function as well?
This is old documentation, but it's interesting that in this sample code no m_free
is defined. The sample code in PEP 3121 does the same thing with a comment that m_free
is not needed since everything is done in m_clear
.
Congrats on the code merge!
LLONG_MIN
comment, and beware that the size of C integer types can vary between compilers and target CPUs. The argument is typed long
(typically 32 bits) so I assume that's correct vs. the LLONG_MIN
comment which would imply long long
(typically 64 bits). So:
assert(size >= 12 && sizeof(num) <= 4);
m_traverse
is called for a GC traversal, which sounds like it starts a mark-and-sweep pass to detect cycles and m_clear
would clean up after the pass. I hadn't found this tp_traverse doc that you referenced.m_free
from the m_traverse
doc: Python doesn't always call m_clear
before deallocating the extension, in particular when the ref count goes to 0 (no cycles). That means the extension type should have an m_free
function, which could just call m_clear
. Freeing/clearing the str object could then be in either one of those two functions. However, the existing pymongo doesn't do it that way, nor do the examples you pointed to, so maybe the docs are wrong. I didn't manage to find a trustworthy example.'\0'
character is called NUL
in ASCII, as distinct from a null
pointer. You can also write 0
or (char)0
.On testing itoa
:
LONG_MIN
, LONG_MAX
, LONG_MIN + 1
, and LONG_MAX - 1
. It could prefill the buffer with no zeros (e.g. memset(buffer, 1, sizeof(buffer))
) to ensure itoa()
puts in the NUL byte. Compare all the string results against snprintf or Python str(i)
.subprocess.run()
? It could detect success/failure via a process return code. See wholecell/utils/filepath.py
in wcEcoli for subprocess.run()
examples._cbson_itoa(PyObject* self, PyObject* args)
by adding another PyMethodDef
to _CBSONMethods
, much like the linkage for _cbson_dict_to_bson()
. It looks like it can use the simpler METH_O
calling convention which takes the self
parameter and 1 argument. _cbson_itoa()
would call itoa()
to convert a given number into a stack-allocated string buffer, then convert the buffer to a Python str
object, and return that.
Then test_bson could do all the rest of the unit test work by calling _cbson_itoa()
and checking the result against str(i)
.It took a lot of trial and error, but I finally managed to get Python to use the m_free
function reliably upon interpreter shutdown (confirmed by printf
statements in the relevant C functions). Here are my takeaways:
_cbson
reference count would not hit zero, and the m_free
/m_clear
functions were never called. m_clear
is still murky to me after all this. If you import the bson
module by itself, m_clear
is called during shutdown before m_free
. However, if you import pymongo
, the m_free
for _cmessage
is immediately followed by the m_free
for _cbson
, with neither m_clear
function being called. del
on every user-accessible reference, the internal reference count is still non-zero). Unless someone is writing crazy bootleg C extensions that themselves import PyMongo, the m_free
/m_clear
functions only seem to matter at shutdown. Very interesting exploration, but I don't think it's worth submitting a PR for.
Just updated my int2str
PR branch. Here's a summary of the key changes I made:
long
and long long
. I think objects of type Py_ssize_t
(the input type used in PyMongo) can be as large as 64 bits because they seem to be a signed version of size_t
, which tops out at unsigned long long
on certain systems. Thus, I changed the input into a long long
and every temporary integral value into an unsigned long long
. long_to_str
function to ensure that the buffer has a size of at least 21, which is just enough space to store the 19 digits of LLONG_MIN, a negative sign, and the null terminator.subprocess.run()
but can't seem to make the required compilation step robust. It currently works on my local machine but not on GitHub Actions, where I get errors about python-config
not existing. Do you have any ideas on how I could make this work?long long
arg type sounds good. Then the function name should then be long_long_to_str()
or decimal_to_str()
or something.python-config
pip could be just the first of many needed pips. I suggest the other route: Add another PyMethodDef to _CBSONMethods to call the number to string function and put most of the unit test code in test_bson.py
. The Mongo folks are more likely to be happy with that since it's much easier to maintain and experiment with the test.
try/except
is to almost never use a bare except:
since it will catch cases like GeneratorExit
, KeyboardInterrupt
, and SystemExit
. Generally, catch specific exception types. Catch the broad class Exception
if really needed but print or log the traceback so it doesn't bury surprises.)main()
, since it's not a unit test that has to run on every environment and every release to check that things are still working.m_free
code. Given your experimental results and the incomplete docs, my take on the semantics is:
m_clear
should clear refs that might be part of a cycle.m_free
should free all refs. It can call m_clear
to do most of the work.Great idea. Thanks so much for all your help over this past week! I've just put up what will hopefully be the last commit on this final PR. Closing unless something else crops up.
Cool!
BTW you'll need to update the requirements.txt
file when the new pymongo release gets to PyPI.
extern void long_long_to_str(long long num, char* str) {
// Buffer should fit 64-bit signed integer
assert(sizeof(str) > 20);
...
^^^ Alas, that assertion checks the size of a pointer, which will be 4 or 8 bytes.
This should work:
extern void long_long_to_str(long long num, char* str, size_t size);
#define INT2STRING(buffer, i) long_long_to_str((i), (buffer), sizeof(buffer))
(Or name the macro differently if the test still needs INT2STRING()
.)
extern void long_long_to_str(long long num, char* str, size_t size) {
// Buffer should fit 64-bit signed integer
assert(size > 20);
...
Oops good catch. Looks like assert
doesn't actually cause my test to fail. I've updated the test with better exception handling, and it appears to work when I try to break it (e.g. give it too small a buffer or forcibly change one of the strings to make them unequal).
Random flex: Internal testing by the PyMongo folks confirmed that my _type_marker
patch does, in fact, yield the encoding/insert performance gains I observed in practice.
Wow this change improves the performance of bson encoding by 130% according to our TestDeepEncoding benchmark and improves the overall performance inserting large documents (TestLargeDocInsertOne/TestLargeDocBulkInsert) by 75%
🥇
BTW is there any performance difference between absNum % 10ULL
, absNum % 10UL
, and absNum % 10U
? The machine code might vary between those. Could try using a tool like objdump
to view the object code.
Amazing @thalassemia!
BTW is there any performance difference between
absNum % 10ULL
,absNum % 10UL
, andabsNum % 10U
? The machine code might vary between those. Could try using a tool likeobjdump
to view the object code.
This is late, but I finally got around to this. On my lab desktop (high-end Intel chip with GCC 11.3.0), the machine code does not change at all regardless of whether I use U, UL, or ULL.
My other PR just got merged, and I'm kinda hoping they'll have more concrete performance numbers to share for that one as well.
Seems the INT2STRING
inefficiency really only matters for large array fields, which our model emits a lot of. Still makes a huge difference in that specific case, which is great to see.
Fantastic!
@prismofeverything mentioned rethinking how Vivarium reads & writes numpy arrays to Mongo. Could it let ndarray.tobytes()
convert an array to a bytes
in one native method call, write that blob to Mongo, and use numpy.frombuffer()
to convert the blob back to an array? (It might need to add the array shape.)
If other Vivarium processes need to read portions of a large array, the writer could potentially partition it into smaller arrays to write or readers could read the entire array and get a view onto a portion of it or extract a range of bytes and convert that to an array.
That's an interesting idea. It would certainly improve serialization performance, but I'm not sure if it's worth the additional complexity. I also like the fact that our current model outputs are, barring the need to extract certain metadata from simData.pkl
for interpretation, entirely language agnostic. Could be useful in case Julia ever does become a true competitor to Python.
Yes, that's a wild idea and I'm unsure about its tradeoffs.
But I'm betting on Mojo, not Julia.
Mojo is only ≈v0.1 so far, but it has very smart goals and foundations. It will be a superset of Python, part of the Python ecosystem, so it can run existing Python code and libraries. With some added language features, it can compile to fast machine code using MLIR (Multi-Level Intermediate Representation) to target the ever growing range of hardware accelerators (vectorization, multi-core, GPU, TPU), thus faster than C and Rust code. When ready, moving from Python to Mojo will be easier than Python 2 to 3.
Chris Lattner is a major force behind LLVM/Clang, Swift (to enable incrementally replacing and modernizing Objective C), MLIR, and Mojo.
Oh yeah, I recently watched a funny video about Mojo and found it super promising as well! I'm excited to see where Lattner and his team take it in the coming years.
After my Numpy optimizations, we spend about 35% of our simulation runtime purely in the
insert_one
function of PyMongo. Using line-by-line profiling of C extensions in py-spy, I discovered that the majority of that time was spent on this line, which is called for every value in every document to be written to the database.I never formally learned C or Python's C API, but after some trial and error, I made the pictured tweaks to the
_type_marker
function and added a single line to thePyInit__cbson
function:TYPEMARKERSTR = PyUnicode_FromString("_type_marker");
.With these changes alone, the time spent in the
insert_one
function decreased by nearly 60% to about 15% of the total simulation runtime. I was wondering if there are any significant downsides to these changes. If not, I could try submitting an issue to the PyMongo JIRA and see what happens.