CovertLab / vEcoli

Whole cell model of E. coli implemented with Vivarium
https://covertlab.github.io/vEcoli/
MIT License
12 stars 3 forks source link

Inefficiencies in PyMongo Driver #195

Closed thalassemia closed 1 year ago

thalassemia commented 1 year ago

After my Numpy optimizations, we spend about 35% of our simulation runtime purely in the insert_one function of PyMongo. Using line-by-line profiling of C extensions in py-spy, I discovered that the majority of that time was spent on this line, which is called for every value in every document to be written to the database.

I never formally learned C or Python's C API, but after some trial and error, I made the pictured tweaks to the _type_marker function and added a single line to the PyInit__cbson function: TYPEMARKERSTR = PyUnicode_FromString("_type_marker");. Screenshot from 2023-05-22 09-08-55

With these changes alone, the time spent in the insert_one function decreased by nearly 60% to about 15% of the total simulation runtime. I was wondering if there are any significant downsides to these changes. If not, I could try submitting an issue to the PyMongo JIRA and see what happens.

eagmon commented 1 year ago

Wow! So checking if there is a '_type_marker' string attribute is taking a ridiculous amount of time, which is reduced by just using a Python unicode string instead? That seems too good to be true. I'm not sure how this influences the C, but I'd imagine you have to be very careful about managing the references.

thalassemia commented 1 year ago

Yeah, PyObject_HasAttrString first creates a new str PyObject* by calling PyUnicode_FromString under the hood (function linked here). Since we know what the string will be here, we can make the object in advance and reuse it to save a lot of time, especially since this function sees so much traffic. I don't know what the memory implications are but I can at least confirm there's no obvious leak (no ballooning memory usage) like in my initial attempts.

eagmon commented 1 year ago

oh, interesting, so it has been just creating that new Python string every time instead of reusing it. Yeah, that seems like a very clean fix.

Robotato commented 1 year ago

I'm also not really familiar with C, but this makes sense to me. Surprising that it makes such a big difference!

Robotato commented 1 year ago

Oops, did not mean to close haha

thalassemia commented 1 year ago

I'm splitting hairs at this point, but there is another section of frequently used code where they use snprintf to convert a counter int inside a for loop to string. If they just cached these somehow, there's another 30-40% performance improvement to be had. The following code illustrates what I mean. Of course, we don't know what items is until runtime and it could be greater than size, so a more sophisticated caching mechanism is needed.

int size = 400000
static char str_cache[size][16];
static int function(int items) {
    int i;
    for(i = 0; i < items; i++) {
        // replace the following
        // char name[16]; 
        // snprintf((name), sizeof((name)), "%d", ((int)i));
        char *name = str_cache[i]
        ... // do stuff with name
    }
}
int main() {
    int i;
    for (i = 0; i < size; i++) {
        snprintf((str_cache[i]), sizeof((str_cache[i])), "%d", ((int)i));
    }
    // call function a lot here
}
1fish2 commented 1 year ago

Great detective work!

Using the Python C API requires baby-sitting many subtle requirements. It's nice when we can let Cython generate that code or at least generate code to crib from.

AFAIK, the TYPEMARKERSTR is OK. It's just one str instance that can leak if the cbson module can/does get unloaded. The proper fix might be to keep it in the cbson module object rather than a static variable, but the existing _type_marker() function doesn't have a reference to the module.

Note: PyObject_HasAttrString() has a fast path that calls the C method *Py_TYPE(v)->tp_getattr(v, (char*)name) without having to convert the name to a Python str. See _PyObject_LookupAttr().

Thus an alternative and potentially more comprehensive optimization would be to make the BSON extension types implement tp_getattr. (Caveat: tp_getattr is deprecated and apparently no longer implemented by the built-in Python classes. That doesn't matter if _type_marker(PyObject* object) is nearly always called on the BSON extension classes.] -- Scratch that idea. The BSON classes are ordinary Python classes, not C++ extension classes. int64 _type_marker

So it's totally worth reporting this PyMongo issue.

Maybe also mention that further down the _type_marker() function is some pointless code:

        /*
         * Py(Long|Int)_AsLong returns -1 for error but -1 is a valid value
         * so we call PyErr_Occurred to differentiate.
         */
        if (type == -1 && PyErr_Occurred()) {
            return -1;  // [return -1 instead of -1]
        }
...
return type;
thalassemia commented 1 year ago

Thanks Jerry! I've just reported the issue, and it can be viewed at this link. What do you think about the other optimization? The performance gain is slimmer there, and it comes at the cost of higher memory usage.

1fish2 commented 1 year ago

Nice!

The other optimization idea has good potential.

One idea is to change snprintf(... "%d" ...) to itoa(i) so it doesn't have to interpret a format string every time. It turns out that itoa is not in the C standard so they're better off copying a well-tested implementation.

If the same integer values repeat often enough, a cache is a good approach. E.g. memoization, where each cache "entry" has a (number, string) pair. To convert the value i it'd reuse or else update cache entry n % i.

FYI the C language tries to pretend that an array is equivalent to a pointer, except when it isn't such as inside another array or a struct. So I'd probably define the string buffers as an array of structs, where each struct holds a char array. The input numbers could go in those structs or in a parallel array.

thalassemia commented 1 year ago

The itoa implementation you linked worked great! For a 2000-second simulation, my pre-built array solution took 244 seconds, PyMongo's current INT2STRING macro took 276 seconds, and your faster itoa took 250 seconds. My profiling shows that the insert_one call is over 2x faster with the optimized itoa.

I made a separate PyMongo issue for this that can be viewed here. Thanks again for all the input guys!

1fish2 commented 1 year ago

Nicely done on the issue reports!

They have lots of open issues and might process a PR much sooner.

In the meantime we could fork the mongo-python-driver repo and put it on PyPI. Or monkey-patch the library to load a different bson._cbson extension.

Note: The bson module implements a SON class which seems to do a lot of work just to provide a dict that keeps its keys in insertion order. As of Python 3.7 -- the oldest supported version of Python -- the built-in dict class keeps its keys in insertion-order.

I don't know how much impact this has on vivarium-ecoli, but the class can't compete with the built-in dict in speed and size. Besides running a bunch of interpreted Python code, has_key() is O(n) rather than O(1) since it has to scan a list, and to_dict() has to recursively transform nested elements.

You could monkey-patch the class to measure the performance impact:

import bson
bson.SON = dict  # do this before any other module imports bson
>>> import bson
>>> bson.SON()
SON([])
>>> bson.SON = dict  # verify the monkey-patch
>>> bson.SON()
{}
thalassemia commented 1 year ago

Thanks for the monkey-patch suggestion. In my testing, this results in an additional 15% improvement in PyMongo insert_one performance! Also, I didn't realize I could make PRs directly on the PyMongo repo until you pointed it out. Those are now up: https://github.com/mongodb/mongo-python-driver/pull/1221 https://github.com/mongodb/mongo-python-driver/pull/1219

1fish2 commented 1 year ago

Awesome!

A PR for the SON speedup could replace the contents of son.py with an updating docstring and SON = dict. The setup.py file already specifies that this pip python_requires=">=3.7".

thalassemia commented 1 year ago

Learned something interesting while running their unit tests with SON = dict. Two dictionaries that contain the same data but in a different order are considered equal using ==. This breaks one of their tests and might be an important feature of their SON implementation.

1fish2 commented 1 year ago

We can fix that. (Tests are good.)

from typing import Any, Iterator

class SON(dict):
    """SON data.

    A subclass of dict that provides a few extra niceties for dealing with SON
    and some Python 2 dict methods. SON provides an API similar to
    collections.OrderedDict, maintaining keys in insertion-order, which in
    Python 3.7+ is provided by dict.
    """

    def copy(self) -> SON:
        return SON(self)

    def has_key(self, key: Any) -> bool:
        return key in self

    def iterkeys(self) -> Iterator:
        return iter(self.keys())

    def itervalues(self) -> Iterator:
        return iter(self.values())

    def __eq__(self, other: Any) -> bool:
        """Comparison to another SON is insertion-order-sensitive while
        comparison to another mapping is order-insensitive.
        """
        if isinstance(other, SON):
            return list(self.items()) == list(other.items())
        return super().__eq__(other)

    def __ne__(self, other: Any) -> bool:
        return not self == other

    def to_dict(self) -> dict:
        return dict(self)
thalassemia commented 1 year ago

I've gotten feedback on both of my PRs. It looks like there are licensing issues surrounding the MySQL code I adapted. Not sure how to proceed.

I also got great advice on the _type_marker PR about adding my new PyObject to the module_state, where it would be automatically freed by the _cbson_clear function on teardown. However, this proved more challenging to implement than I anticipated. Can you look over my code on that PR @1fish2?

Now that I've rerun my sims a couple of times, I think our sim runtimes with or without monkey-patching the SON class are within the margin of error. Implementing it in PyMongo is also tricky because to support Python versions before 3.9, I'd need to add from __future__ import annotations at the top of every file which uses SON as a subscriptable type hint to enable PEP 563. You were right to suspect that the recursive nature of to_dict may be important. There's a test for that which the new SON fails.

1fish2 commented 1 year ago

This is very challenging.

Looking into the _type_marker PR, this doc aims to explain module state. We might need to find clearer documentation or test if a sample Cython extension generates C code that can handle module state. I read stuff on subinterpreters and some of PEPs 489, 554, 573, and 3121. It's a tarpit that the Python maintainers are striving to fix.

I'll mention some of the issues, then some alternative ideas.

Problems:

Alternative ideas:

1fish2 commented 1 year ago

I see that you wrote a replacement for the MySQL code to avoid the licensing issues. This is terrific and esp. that you included a unit test that checks boundary cases! Your implementation mostly looks good to me but there are details and cautions. C is tricky!

1fish2 commented 1 year ago

Since monkey-patching the SON class doesn't make a significant speedup, then I totally agree about dropping it.

FYI __future__ annotations is still on hold in Python 3.11. It might never become the future.

Using a string literal as a type declaration avoids that makes the type declaration a single object (thus saving load time like "future" annotations were supposed to do), handles forward references (as in a method referring to its own class, "SON[_Key, _Value]"), and avoids needing to import definitions just for types.

1fish2 commented 1 year ago

On second thought, it really should protect from buffer overrun. Apparently the INT2STRING was called with a character array so it can get the buffer size to do this. And it should maintain the existing interface to the degree possible, not that the macro stated its interface. snprintf() would return an int but the platform-specific macro definitions could differ on that, and it's not needed if the void function compiles everywhere that pymongo calls it.

itoa() is a common built-in function and it doesn't take a buffer size, so this should not clash with that name. You might pick something other than what I wrote below.

So in the .h file:

#define INT2STRING(buffer, i) long_to_decimal((buffer), sizeof((buffer)), (i))

/* Converts an integer to its string representation in decimal notation. */
extern void long_to_decimal(char* buffer, size_t size, long num);

and the .c file:

include "_cbsonmodule.h"

extern void long_to_decimal(char* buffer, size_t size, long num) {
  assert(size >= 20);  // or some other way to fail? snprintf() does something more whacky

  ... // the rest as it was,
  ... // but do the type cast(s) as needed for buffer containing possibly signed chars, e.g.
  unsigned char* str = (unsigned char*)buffer;
  // ...
}

The test function shouldn't need to clear the buffers since this function shouldn't care what's in the output buffer going in. If I'm wrong about that, memset(str_1, 0, len(str_1)) is safer than memset(str_1, 0, strlen(str_1)) since strlen will scan for a terminating \0 char and if goes past the end of the buffer, memset() will cause havoc.

thalassemia commented 1 year ago

My _type_marker PR just got merged in. Regarding the concerns you raised, both _cbson and _cmessage call PyModule_Create with a PyModuleDef (here and here) in their PyInit_{name} functions. In its PyModuleDef struct, _cbson_clear is placed in the m_clear slot. Is this not good enough to ensure that no memory is leaked? I don't quite understand the difference between m_clear and m_free. If I'm understanding the documentation correctly, Py_CLEAR does not deallocate, but it should decrement the reference count of its argument and set it to NULL, at which point the Python GC can work its magic. For functions that are registered in the method table (_CSBONMethods and _CMessageMethods), METHVARARGS ensures that the first parameter self for module functions is the module object. I think all functions outside the method table that expect a method object as the first argument are only called by functions that are included in the table and can pass on their reference to the module.

The tricky part for me was figuring out PyArg_ParseTuple format strings (O& for object conversion is a cool idea, but my final PR had to scrap it so convert_codec_options could accept a module object as its first argument) and figuring out that the _cmessage module (which calls convert_codec_options in _cbson and must therefore have a reference to the _cbson module) already has a reference to the _cbson module in its own module state.

thalassemia commented 1 year ago

Thanks for the INT2STRING suggestions! I'll make those changes right away. The more I learn about C, the more I appreciate how good I've had it all these years with Python. The maintainers also want me to include a unit test in Python for my code. Do you know how to do that? I had intended for my test_int2str.c script to be temporary just to convince them that my code works, because I don't know how to integrate it into their preexisting testing harness.

thalassemia commented 1 year ago

Now that I've done some more reading, here's my understanding difference between m_clear and m_free. m_clear is only called when the GC detects a reference cycle, and its purpose is to break these cycles by decrementing reference counts and setting pointers to NULL. As noted in the documentation, Python strings are contained objects that are never part of a reference cycle, so the line that I added to _cbson_clear was unnecessary.

All objects of type PyModule_Type perform deallocation by calling this general module_dealloc destructor. If some special behavior is necessary to deallocate memory, it can be included in the m_free function, which is called here in that destructor. The destructor then frees the memory allocated for the module state. This part confuses me because the _cbson and _cmessage module states contain PyObject* pointers. If no reference cycle is detected and m_clear is not called, does that mean that these objects never get their reference counts decremented and their underlying memory will never be freed even though their PyObject* pointer has been deallocated? If so, would it fix the code to put _cbson_clear but with Py_CLEAR for every member of the module_state struct as the m_free function as well?

This is old documentation, but it's interesting that in this sample code no m_free is defined. The sample code in PEP 3121 does the same thing with a comment that m_free is not needed since everything is done in m_clear.

1fish2 commented 1 year ago

Congrats on the code merge!

1fish2 commented 1 year ago

On testing itoa:

thalassemia commented 1 year ago

It took a lot of trial and error, but I finally managed to get Python to use the m_free function reliably upon interpreter shutdown (confirmed by printf statements in the relevant C functions). Here are my takeaways:

Very interesting exploration, but I don't think it's worth submitting a PR for.

thalassemia commented 1 year ago

Just updated my int2str PR branch. Here's a summary of the key changes I made:

1fish2 commented 1 year ago
thalassemia commented 1 year ago

Great idea. Thanks so much for all your help over this past week! I've just put up what will hopefully be the last commit on this final PR. Closing unless something else crops up.

1fish2 commented 1 year ago

Cool!

BTW you'll need to update the requirements.txt file when the new pymongo release gets to PyPI.

extern void long_long_to_str(long long num, char* str) {
    // Buffer should fit 64-bit signed integer
    assert(sizeof(str) > 20);
    ...

^^^ Alas, that assertion checks the size of a pointer, which will be 4 or 8 bytes.

This should work:

extern void long_long_to_str(long long num, char* str, size_t size);
#define INT2STRING(buffer, i) long_long_to_str((i), (buffer), sizeof(buffer))

(Or name the macro differently if the test still needs INT2STRING().)

extern void long_long_to_str(long long num, char* str, size_t size) {
    // Buffer should fit 64-bit signed integer
    assert(size > 20);
    ...
thalassemia commented 1 year ago

Oops good catch. Looks like assert doesn't actually cause my test to fail. I've updated the test with better exception handling, and it appears to work when I try to break it (e.g. give it too small a buffer or forcibly change one of the strings to make them unequal).

thalassemia commented 1 year ago

Random flex: Internal testing by the PyMongo folks confirmed that my _type_marker patch does, in fact, yield the encoding/insert performance gains I observed in practice.

1fish2 commented 1 year ago

Wow this change improves the performance of bson encoding by 130% according to our TestDeepEncoding benchmark and improves the overall performance inserting large documents (TestLargeDocInsertOne/TestLargeDocBulkInsert) by 75%

🥇

BTW is there any performance difference between absNum % 10ULL, absNum % 10UL, and absNum % 10U? The machine code might vary between those. Could try using a tool like objdump to view the object code.

eagmon commented 1 year ago

Amazing @thalassemia!

thalassemia commented 1 year ago

BTW is there any performance difference between absNum % 10ULL, absNum % 10UL, and absNum % 10U? The machine code might vary between those. Could try using a tool like objdump to view the object code.

This is late, but I finally got around to this. On my lab desktop (high-end Intel chip with GCC 11.3.0), the machine code does not change at all regardless of whether I use U, UL, or ULL.

My other PR just got merged, and I'm kinda hoping they'll have more concrete performance numbers to share for that one as well.

thalassemia commented 1 year ago

Seems the INT2STRING inefficiency really only matters for large array fields, which our model emits a lot of. Still makes a huge difference in that specific case, which is great to see.

1fish2 commented 1 year ago

Fantastic!

@prismofeverything mentioned rethinking how Vivarium reads & writes numpy arrays to Mongo. Could it let ndarray.tobytes() convert an array to a bytes in one native method call, write that blob to Mongo, and use numpy.frombuffer() to convert the blob back to an array? (It might need to add the array shape.)

If other Vivarium processes need to read portions of a large array, the writer could potentially partition it into smaller arrays to write or readers could read the entire array and get a view onto a portion of it or extract a range of bytes and convert that to an array.

thalassemia commented 1 year ago

That's an interesting idea. It would certainly improve serialization performance, but I'm not sure if it's worth the additional complexity. I also like the fact that our current model outputs are, barring the need to extract certain metadata from simData.pkl for interpretation, entirely language agnostic. Could be useful in case Julia ever does become a true competitor to Python.

1fish2 commented 1 year ago

Yes, that's a wild idea and I'm unsure about its tradeoffs.

But I'm betting on Mojo, not Julia.

Mojo is only ≈v0.1 so far, but it has very smart goals and foundations. It will be a superset of Python, part of the Python ecosystem, so it can run existing Python code and libraries. With some added language features, it can compile to fast machine code using MLIR (Multi-Level Intermediate Representation) to target the ever growing range of hardware accelerators (vectorization, multi-core, GPU, TPU), thus faster than C and Rust code. When ready, moving from Python to Mojo will be easier than Python 2 to 3.

Chris Lattner is a major force behind LLVM/Clang, Swift (to enable incrementally replacing and modernizing Objective C), MLIR, and Mojo.

thalassemia commented 1 year ago

Oh yeah, I recently watched a funny video about Mojo and found it super promising as well! I'm excited to see where Lattner and his team take it in the coming years.