Inefficiencies in PyMongo Driver

thalassemia commented 1 year ago

After my Numpy optimizations, we spend about 35% of our simulation runtime purely in the insert_one function of PyMongo. Using line-by-line profiling of C extensions in py-spy, I discovered that the majority of that time was spent on this line, which is called for every value in every document to be written to the database.

I never formally learned C or Python's C API, but after some trial and error, I made the pictured tweaks to the _type_marker function and added a single line to the PyInit__cbson function: TYPEMARKERSTR = PyUnicode_FromString("_type_marker");. Screenshot from 2023-05-22 09-08-55

With these changes alone, the time spent in the insert_one function decreased by nearly 60% to about 15% of the total simulation runtime. I was wondering if there are any significant downsides to these changes. If not, I could try submitting an issue to the PyMongo JIRA and see what happens.

eagmon commented 1 year ago

Wow! So checking if there is a '_type_marker' string attribute is taking a ridiculous amount of time, which is reduced by just using a Python unicode string instead? That seems too good to be true. I'm not sure how this influences the C, but I'd imagine you have to be very careful about managing the references.

thalassemia commented 1 year ago

Yeah, PyObject_HasAttrString first creates a new str PyObject* by calling PyUnicode_FromString under the hood (function linked here). Since we know what the string will be here, we can make the object in advance and reuse it to save a lot of time, especially since this function sees so much traffic. I don't know what the memory implications are but I can at least confirm there's no obvious leak (no ballooning memory usage) like in my initial attempts.

eagmon commented 1 year ago

oh, interesting, so it has been just creating that new Python string every time instead of reusing it. Yeah, that seems like a very clean fix.

Robotato commented 1 year ago

I'm also not really familiar with C, but this makes sense to me. Surprising that it makes such a big difference!

Robotato commented 1 year ago

Oops, did not mean to close haha

thalassemia commented 1 year ago

I'm splitting hairs at this point, but there is another section of frequently used code where they use snprintf to convert a counter int inside a for loop to string. If they just cached these somehow, there's another 30-40% performance improvement to be had. The following code illustrates what I mean. Of course, we don't know what items is until runtime and it could be greater than size, so a more sophisticated caching mechanism is needed.

int size = 400000
static char str_cache[size][16];
static int function(int items) {
    int i;
    for(i = 0; i < items; i++) {
        // replace the following
        // char name[16]; 
        // snprintf((name), sizeof((name)), "%d", ((int)i));
        char *name = str_cache[i]
        ... // do stuff with name
    }
}
int main() {
    int i;
    for (i = 0; i < size; i++) {
        snprintf((str_cache[i]), sizeof((str_cache[i])), "%d", ((int)i));
    }
    // call function a lot here
}

1fish2 commented 1 year ago

Great detective work!

Using the Python C API requires baby-sitting many subtle requirements. It's nice when we can let Cython generate that code or at least generate code to crib from.

AFAIK, the TYPEMARKERSTR is OK. It's just one str instance that can leak if the cbson module can/does get unloaded. The proper fix might be to keep it in the cbson module object rather than a static variable, but the existing _type_marker() function doesn't have a reference to the module.

Note: PyObject_HasAttrString() has a fast path that calls the C method *Py_TYPE(v)->tp_getattr(v, (char*)name) without having to convert the name to a Python str. See _PyObject_LookupAttr().

Thus an alternative and potentially more comprehensive optimization would be to make the BSON extension types implement tp_getattr. (Caveat: tp_getattr is deprecated and apparently no longer implemented by the built-in Python classes. That doesn't matter if _type_marker(PyObject* object) is nearly always called on the BSON extension classes.] -- Scratch that idea. The BSON classes are ordinary Python classes, not C++ extension classes. int64 _type_marker

So it's totally worth reporting this PyMongo issue.

Maybe also mention that further down the _type_marker() function is some pointless code:

        /*
         * Py(Long|Int)_AsLong returns -1 for error but -1 is a valid value
         * so we call PyErr_Occurred to differentiate.
         */
        if (type == -1 && PyErr_Occurred()) {
            return -1;  // [return -1 instead of -1]
        }
...
return type;

thalassemia commented 1 year ago

Thanks Jerry! I've just reported the issue, and it can be viewed at this link. What do you think about the other optimization? The performance gain is slimmer there, and it comes at the cost of higher memory usage.

1fish2 commented 1 year ago

Nice!

The other optimization idea has good potential.

One idea is to change snprintf(... "%d" ...) to itoa(i) so it doesn't have to interpret a format string every time. It turns out that itoa is not in the C standard so they're better off copying a well-tested implementation.

If the same integer values repeat often enough, a cache is a good approach. E.g. memoization, where each cache "entry" has a (number, string) pair. To convert the value i it'd reuse or else update cache entry n % i.

FYI the C language tries to pretend that an array is equivalent to a pointer, except when it isn't such as inside another array or a struct. So I'd probably define the string buffers as an array of structs, where each struct holds a char array. The input numbers could go in those structs or in a parallel array.

thalassemia commented 1 year ago

The itoa implementation you linked worked great! For a 2000-second simulation, my pre-built array solution took 244 seconds, PyMongo's current INT2STRING macro took 276 seconds, and your faster itoa took 250 seconds. My profiling shows that the insert_one call is over 2x faster with the optimized itoa.

I made a separate PyMongo issue for this that can be viewed here. Thanks again for all the input guys!

1fish2 commented 1 year ago

Nicely done on the issue reports!

They have lots of open issues and might process a PR much sooner.

In the meantime we could fork the mongo-python-driver repo and put it on PyPI. Or monkey-patch the library to load a different bson._cbson extension.

Note: The bson module implements a SON class which seems to do a lot of work just to provide a dict that keeps its keys in insertion order. As of Python 3.7 -- the oldest supported version of Python -- the built-in dict class keeps its keys in insertion-order.

I don't know how much impact this has on vivarium-ecoli, but the class can't compete with the built-in dict in speed and size. Besides running a bunch of interpreted Python code, has_key() is O(n) rather than O(1) since it has to scan a list, and to_dict() has to recursively transform nested elements.

You could monkey-patch the class to measure the performance impact:

import bson
bson.SON = dict  # do this before any other module imports bson

>>> import bson
>>> bson.SON()
SON([])
>>> bson.SON = dict  # verify the monkey-patch
>>> bson.SON()
{}

thalassemia commented 1 year ago

Thanks for the monkey-patch suggestion. In my testing, this results in an additional 15% improvement in PyMongo insert_one performance! Also, I didn't realize I could make PRs directly on the PyMongo repo until you pointed it out. Those are now up: https://github.com/mongodb/mongo-python-driver/pull/1221 https://github.com/mongodb/mongo-python-driver/pull/1219

1fish2 commented 1 year ago

Awesome!

A PR for the SON speedup could replace the contents of son.py with an updating docstring and SON = dict. The setup.py file already specifies that this pip python_requires=">=3.7".

thalassemia commented 1 year ago

Learned something interesting while running their unit tests with SON = dict. Two dictionaries that contain the same data but in a different order are considered equal using ==. This breaks one of their tests and might be an important feature of their SON implementation.

1fish2 commented 1 year ago

We can fix that. (Tests are good.)

from typing import Any, Iterator

class SON(dict):
    """SON data.

    A subclass of dict that provides a few extra niceties for dealing with SON
    and some Python 2 dict methods. SON provides an API similar to
    collections.OrderedDict, maintaining keys in insertion-order, which in
    Python 3.7+ is provided by dict.
    """

    def copy(self) -> SON:
        return SON(self)

    def has_key(self, key: Any) -> bool:
        return key in self

    def iterkeys(self) -> Iterator:
        return iter(self.keys())

    def itervalues(self) -> Iterator:
        return iter(self.values())

    def __eq__(self, other: Any) -> bool:
        """Comparison to another SON is insertion-order-sensitive while
        comparison to another mapping is order-insensitive.
        """
        if isinstance(other, SON):
            return list(self.items()) == list(other.items())
        return super().__eq__(other)

    def __ne__(self, other: Any) -> bool:
        return not self == other

    def to_dict(self) -> dict:
        return dict(self)

This keeps most of the SON semantics including __eq__() and the obsolete Python 2 dict methods like has_key().
to_dict() is way faster by not recursively copying/transforming its contents since this SON and dict are more substitutable, but we should find out if they rely on that behavior.
There's no __deepcopy__ method here since there's no such Python "dunder" special method. They probably just meant to make it a private method, which should be spelled _deepcopy.
We might want to keep their _Key = TypeVar("_Key") static typing, but I don't expect it'd catch any problems.

thalassemia commented 1 year ago

I've gotten feedback on both of my PRs. It looks like there are licensing issues surrounding the MySQL code I adapted. Not sure how to proceed.

I also got great advice on the _type_marker PR about adding my new PyObject to the module_state, where it would be automatically freed by the _cbson_clear function on teardown. However, this proved more challenging to implement than I anticipated. Can you look over my code on that PR @1fish2?

Now that I've rerun my sims a couple of times, I think our sim runtimes with or without monkey-patching the SON class are within the margin of error. Implementing it in PyMongo is also tricky because to support Python versions before 3.9, I'd need to add from __future__ import annotations at the top of every file which uses SON as a subscriptable type hint to enable PEP 563. You were right to suspect that the recursive nature of to_dict may be important. There's a test for that which the new SON fails.

1fish2 commented 1 year ago

This is very challenging.

Looking into the _type_marker PR, this doc aims to explain module state. We might need to find clearer documentation or test if a sample Cython extension generates C code that can handle module state. I read stuff on subinterpreters and some of PEPs 489, 554, 573, and 3121. It's a tarpit that the Python maintainers are striving to fix.

I'll mention some of the issues, then some alternative ideas.

Problems:

PyModule_GetState(PyObject *module) has minimal documentation but the argument name and the source code imply it takes a module object, not an instance of a class in that module, even if that value is sometimes named self. See PyState_FindModule below for a workaround.
Even if GETSTATE(self) works on an instance of a class in that module, these _cbsonmodule.c functions use the _type_marker attribute to test if the given object is one of its own instances. Until a function successfully discovers that a given object is a cbsonmodule instance, the GETSTATE(self) could point to some other module, so we can't use it.
_cbson_clear() is part of GC cycle detection. It's called to deallocate module state (and Py_CLEAR() doesn't deallocate) -- that was a bum steer. Rather, the PyModuleDef m_free slot is for deallocating module state. We could add an m_free function and set that slot in the _cbsonmodule.c module.
- This extension module doesn't set PyModuleDef .m_slots, which means it doesn't do multi-phase init, which I think implies it's not "expected to support sub-interpreters" and doesn't get some of the newer run-time support in the above-mentioned PEPs to access module state. If it doesn't support sub-interpreters, storing the str object in a C static variable should be fine.

Alternative ideas:

The simplest improvement over the original code is to allocate the str once and pass it to both PyObject_HasAttr() and PyObject_GetAttr(), then call DECREF. No module state tarpit.
Some of the slowness of PyUnicode_FromString() is counting up the string's chars then decoding variable-byte UTF-8 characters. We can instead call PyUnicode_DecodeASCII(TYPE_MARKER_NAME, TYPE_MARKER_LEN, NULL) where we've statically computed the string length (12):
```
static const char * const TYPE_MARKER_NAME = "_type_marker";
static const size_t TYPE_MARKER_LEN = strlen(TYPE_MARKER_NAME);
```
Since it is a single-phase init module, code like _type_marker() can find the module by calling module = PyState_FindModule(&module_state). It can then call GETSTATE(self) to access module state. This will be slower than fetching a C static variable, enough slower that it might not be faster than the two no-state speedups above.
An extension module init can save state w/o mods to module_state by calling one of the support functions like PyModule_AddStringConstant(module, "_type_marker", "_type_marker"). But I don't see how this saves time since fetching the constant seems to require calling PyDict_GetItemString(PyModule_GetDict(module), "_type_marker").

1fish2 commented 1 year ago

I see that you wrote a replacement for the MySQL code to avoid the licensing issues. This is terrific and esp. that you included a unit test that checks boundary cases! Your implementation mostly looks good to me but there are details and cautions. C is tricky!

A function implementation belongs in a .c file. A .h file gets textually included in a bunch of .c files. That's a fine place for a macro definition or a function prototype like extern void itoa(long int num, char* str).
This function should be declared extern, not static. static means private to each compilation unit (compiled .o file). extern means linked between compilation units. The .c file that implements an extern function must include the .h file which defines its function prototype, otherwise C cannot do the type checking.
Only one .c file should contain the implementation. Its unit test needs to include the .h file and link to the compiled .c file.
The unit test probably shouldn't include the MySQL int10_to_str if they're concerned about the license. Testing results against snprintf is good.
The C and C++ language standards have this batshit crazy notion of Undefined Behavior, where the compiler is allowed to assume that all programmers will always ensure that a few hundred types of problems just never arise. Programmer teams are assumed to know about all UB cases (I don't know of any programmers who do) and never make mistakes. Examples include signed integer overflow/underflow, null pointer dereference, an shifting a signed integer by too many bits. If they do arise, all bets are off. A compiler might produce reasonable code or not. It needn't check for overflow in a loop, thus letting the code loop forever. A different compiler, or on a different OS, or just a new release of the compiler (with a better optimizer that takes more advantage of the UB rules) might produce code that opens up a security flaw, or loops forever, or does uncontrolled engine acceleration, or erases the hard disk. So a unit test cannot reveal if the code is OK. This is why it's good to borrow well-tested portable code for tricky parts.
So doing some work in unsigned math helps. I forget how unsigned integers handle negation. Still, ignoring UB, negating a signed twos complement minimum integer produces the same bit pattern -- it doesn't fit in the max positive number.
-num for sure risks UB. -(unsigned long)num might work, but I see that the MySQL code does what we can write more concisely as 0UL - (unsigned long)num. unsigned long is the same as unsigned long int.
Also do absNum = (unsigned long)num; not absNum = num; to avoid warnings from some compilers.
Whether char is signed or unsigned depends on the compiler. This code might be OK there but I'd explicitly cast (char)(digit + '0') to at least avoid warnings from some compilers.
I'm not sure if some compilers will warn or error about digit = absNum % 10 when digit is an int and absNum is an unsigned long, both because of the signed/unsigned difference and the size difference.
Storing past the end of a memory node is definitely bad whether UB or not. At least boldly document the minimum required size of itoa's str argument memory.
Do you need to change something so their unit tests will run test_int2str.c?
It's fine for the unit test to hardcode the buffer size, but use a const or a macro for that size everywhere to ensure they stay consistent.

1fish2 commented 1 year ago

Since monkey-patching the SON class doesn't make a significant speedup, then I totally agree about dropping it.

FYI __future__ annotations is still on hold in Python 3.11. It might never become the future.

Using a string literal as a type declaration avoids that makes the type declaration a single object (thus saving load time like "future" annotations were supposed to do), handles forward references (as in a method referring to its own class, "SON[_Key, _Value]"), and avoids needing to import definitions just for types.

1fish2 commented 1 year ago

On second thought, it really should protect from buffer overrun. Apparently the INT2STRING was called with a character array so it can get the buffer size to do this. And it should maintain the existing interface to the degree possible, not that the macro stated its interface. snprintf() would return an int but the platform-specific macro definitions could differ on that, and it's not needed if the void function compiles everywhere that pymongo calls it.

itoa() is a common built-in function and it doesn't take a buffer size, so this should not clash with that name. You might pick something other than what I wrote below.

So in the .h file:

#define INT2STRING(buffer, i) long_to_decimal((buffer), sizeof((buffer)), (i))

/* Converts an integer to its string representation in decimal notation. */
extern void long_to_decimal(char* buffer, size_t size, long num);

and the .c file:

include "_cbsonmodule.h"

extern void long_to_decimal(char* buffer, size_t size, long num) {
  assert(size >= 20);  // or some other way to fail? snprintf() does something more whacky

  ... // the rest as it was,
  ... // but do the type cast(s) as needed for buffer containing possibly signed chars, e.g.
  unsigned char* str = (unsigned char*)buffer;
  // ...
}

The test function shouldn't need to clear the buffers since this function shouldn't care what's in the output buffer going in. If I'm wrong about that, memset(str_1, 0, len(str_1)) is safer than memset(str_1, 0, strlen(str_1)) since strlen will scan for a terminating \0 char and if goes past the end of the buffer, memset() will cause havoc.

thalassemia commented 1 year ago

My _type_marker PR just got merged in. Regarding the concerns you raised, both _cbson and _cmessage call PyModule_Create with a PyModuleDef (here and here) in their PyInit_{name} functions. In its PyModuleDef struct, _cbson_clear is placed in the m_clear slot. Is this not good enough to ensure that no memory is leaked? I don't quite understand the difference between m_clear and m_free. If I'm understanding the documentation correctly, Py_CLEAR does not deallocate, but it should decrement the reference count of its argument and set it to NULL, at which point the Python GC can work its magic. For functions that are registered in the method table (_CSBONMethods and _CMessageMethods), METHVARARGS ensures that the first parameter self for module functions is the module object. I think all functions outside the method table that expect a method object as the first argument are only called by functions that are included in the table and can pass on their reference to the module.

The tricky part for me was figuring out PyArg_ParseTuple format strings (O& for object conversion is a cool idea, but my final PR had to scrap it so convert_codec_options could accept a module object as its first argument) and figuring out that the _cmessage module (which calls convert_codec_options in _cbson and must therefore have a reference to the _cbson module) already has a reference to the _cbson module in its own module state.

thalassemia commented 1 year ago

Thanks for the INT2STRING suggestions! I'll make those changes right away. The more I learn about C, the more I appreciate how good I've had it all these years with Python. The maintainers also want me to include a unit test in Python for my code. Do you know how to do that? I had intended for my test_int2str.c script to be temporary just to convince them that my code works, because I don't know how to integrate it into their preexisting testing harness.

thalassemia commented 1 year ago

Now that I've done some more reading, here's my understanding difference between m_clear and m_free. m_clear is only called when the GC detects a reference cycle, and its purpose is to break these cycles by decrementing reference counts and setting pointers to NULL. As noted in the documentation, Python strings are contained objects that are never part of a reference cycle, so the line that I added to _cbson_clear was unnecessary.

All objects of type PyModule_Type perform deallocation by calling this general module_dealloc destructor. If some special behavior is necessary to deallocate memory, it can be included in the m_free function, which is called here in that destructor. The destructor then frees the memory allocated for the module state. This part confuses me because the _cbson and _cmessage module states contain PyObject* pointers. If no reference cycle is detected and m_clear is not called, does that mean that these objects never get their reference counts decremented and their underlying memory will never be freed even though their PyObject* pointer has been deallocated? If so, would it fix the code to put _cbson_clear but with Py_CLEAR for every member of the module_state struct as the m_free function as well?

This is old documentation, but it's interesting that in this sample code no m_free is defined. The sample code in PEP 3121 does the same thing with a comment that m_free is not needed since everything is done in m_clear.

1fish2 commented 1 year ago

Congrats on the code merge!

Correction: In my code to check the number buffer size, I forgot to add 1 for the terminating NUL char, got confused by the LLONG_MIN comment, and beware that the size of C integer types can vary between compilers and target CPUs. The argument is typed long (typically 32 bits) so I assume that's correct vs. the LLONG_MIN comment which would imply long long (typically 64 bits). So:
```
assert(size >= 12 && sizeof(num) <= 4);
```
(Indeed, for safety and productivity, all the C code in the world really needs to get replaced with Rust code or another low-level language with memory safety, no UB, and many fewer pitfalls.)
Thanks for correcting me on the GC hooks. This m_traverse doc says m_traverse is called for a GC traversal, which sounds like it starts a mark-and-sweep pass to detect cycles and m_clear would clean up after the pass. I hadn't found this tp_traverse doc that you referenced.
I have the same tentative conclusion about m_free from the m_traverse doc: Python doesn't always call m_clear before deallocating the extension, in particular when the ref count goes to 0 (no cycles). That means the extension type should have an m_free function, which could just call m_clear. Freeing/clearing the str object could then be in either one of those two functions. However, the existing pymongo doesn't do it that way, nor do the examples you pointed to, so maybe the docs are wrong. I didn't manage to find a trustworthy example.
I'll look into the Python unit test question. The module probably needs to export another C function for that.
BTW the '\0' character is called NUL in ASCII, as distinct from a null pointer. You can also write 0 or (char)0.

1fish2 commented 1 year ago

On testing itoa:

Ideally the unit test would include some negative numbers, LONG_MIN, LONG_MAX, LONG_MIN + 1, and LONG_MAX - 1. It could prefill the buffer with no zeros (e.g. memset(buffer, 1, sizeof(buffer))) to ensure itoa() puts in the NUL byte. Compare all the string results against snprintf or Python str(i).
Would they be happy if the test suite invokes the C unit test via subprocess.run()? It could detect success/failure via a process return code. See wholecell/utils/filepath.py in wcEcoli for subprocess.run() examples.
Otherwise, I think the extension would have to export a function _cbson_itoa(PyObject* self, PyObject* args) by adding another PyMethodDef to _CBSONMethods, much like the linkage for _cbson_dict_to_bson(). It looks like it can use the simpler METH_O calling convention which takes the self parameter and 1 argument. _cbson_itoa() would call itoa() to convert a given number into a stack-allocated string buffer, then convert the buffer to a Python str object, and return that. Then test_bson could do all the rest of the unit test work by calling _cbson_itoa() and checking the result against str(i).

thalassemia commented 1 year ago

It took a lot of trial and error, but I finally managed to get Python to use the m_free function reliably upon interpreter shutdown (confirmed by printf statements in the relevant C functions). Here are my takeaways:

Without these changes, the _cbson reference count would not hit zero, and the m_free/m_clear functions were never called.
The function of m_clear is still murky to me after all this. If you import the bson module by itself, m_clear is called during shutdown before m_free. However, if you import pymongo, the m_free for _cmessage is immediately followed by the m_free for _cbson, with neither m_clear function being called.
It is basically impossible to unload C extension modules in a Python script (even after using del on every user-accessible reference, the internal reference count is still non-zero). Unless someone is writing crazy bootleg C extensions that themselves import PyMongo, the m_free/m_clear functions only seem to matter at shutdown.
Running under Valgrind, there were no definite or indirect leaks before or after my change. There was ~1.5 MB possibly lost and ~4.5 MB still accessible before my change, and both only decreased by a few KB afterward. I wish I knew where those other leaks came from, but the stack traces were completely indecipherable to me.

Very interesting exploration, but I don't think it's worth submitting a PR for.

thalassemia commented 1 year ago

Just updated my int2str PR branch. Here's a summary of the key changes I made:

Thanks for pointing out my mix-up between long and long long. I think objects of type Py_ssize_t (the input type used in PyMongo) can be as large as 64 bits because they seem to be a signed version of size_t, which tops out at unsigned long long on certain systems. Thus, I changed the input into a long long and every temporary integral value into an unsigned long long.
I added an assert at the start of the newly-renamed long_to_str function to ensure that the buffer has a size of at least 21, which is just enough space to store the 19 digits of LLONG_MIN, a negative sign, and the null terminator.
I tried to add a Python test using subprocess.run() but can't seem to make the required compilation step robust. It currently works on my local machine but not on GitHub Actions, where I get errors about python-config not existing. Do you have any ideas on how I could make this work?

1fish2 commented 1 year ago

A long long arg type sounds good. Then the function name should then be long_long_to_str() or decimal_to_str() or something.
GitHub Actions tripping over a missing python-config pip could be just the first of many needed pips. I suggest the other route: Add another PyMethodDef to _CBSONMethods to call the number to string function and put most of the unit test code in test_bson.py. The Mongo folks are more likely to be happy with that since it's much easier to maintain and experiment with the test.
- (BTW, good practice for Python try/except is to almost never use a bare except: since it will catch cases like GeneratorExit, KeyboardInterrupt, and SystemExit. Generally, catch specific exception types. Catch the broad class Exception if really needed but print or log the traceback so it doesn't bury surprises.)
The performance measurement loop could stay in C code, called by main(), since it's not a unit test that has to run on every environment and every release to check that things are still working.
I agree that it's not worth submitting a PR to fix the m_free code. Given your experimental results and the incomplete docs, my take on the semantics is:
- m_clear should clear refs that might be part of a cycle.
- m_free should free all refs. It can call m_clear to do most of the work.
- PyMongo doesn't currently do this, which could leak memory in a case like loading it into a subinterpreter and later deleting the subinterpreter.
- Python has complicated and possibly buggy rules on when it calls them.
- Maybe file an Issue and let them fix it.

thalassemia commented 1 year ago

Great idea. Thanks so much for all your help over this past week! I've just put up what will hopefully be the last commit on this final PR. Closing unless something else crops up.

1fish2 commented 1 year ago

Cool!

BTW you'll need to update the requirements.txt file when the new pymongo release gets to PyPI.

extern void long_long_to_str(long long num, char* str) {
    // Buffer should fit 64-bit signed integer
    assert(sizeof(str) > 20);
    ...

^^^ Alas, that assertion checks the size of a pointer, which will be 4 or 8 bytes.

This should work:

extern void long_long_to_str(long long num, char* str, size_t size);
#define INT2STRING(buffer, i) long_long_to_str((i), (buffer), sizeof(buffer))

(Or name the macro differently if the test still needs INT2STRING().)

extern void long_long_to_str(long long num, char* str, size_t size) {
    // Buffer should fit 64-bit signed integer
    assert(size > 20);
    ...

thalassemia commented 1 year ago

Oops good catch. Looks like assert doesn't actually cause my test to fail. I've updated the test with better exception handling, and it appears to work when I try to break it (e.g. give it too small a buffer or forcibly change one of the strings to make them unequal).

thalassemia commented 1 year ago

Random flex: Internal testing by the PyMongo folks confirmed that my _type_marker patch does, in fact, yield the encoding/insert performance gains I observed in practice.

1fish2 commented 1 year ago

Wow this change improves the performance of bson encoding by 130% according to our TestDeepEncoding benchmark and improves the overall performance inserting large documents (TestLargeDocInsertOne/TestLargeDocBulkInsert) by 75%

🥇

BTW is there any performance difference between absNum % 10ULL, absNum % 10UL, and absNum % 10U? The machine code might vary between those. Could try using a tool like objdump to view the object code.

eagmon commented 1 year ago

Amazing @thalassemia!

thalassemia commented 1 year ago

BTW is there any performance difference between absNum % 10ULL, absNum % 10UL, and absNum % 10U? The machine code might vary between those. Could try using a tool like objdump to view the object code.

This is late, but I finally got around to this. On my lab desktop (high-end Intel chip with GCC 11.3.0), the machine code does not change at all regardless of whether I use U, UL, or ULL.

My other PR just got merged, and I'm kinda hoping they'll have more concrete performance numbers to share for that one as well.

thalassemia commented 1 year ago

Seems the INT2STRING inefficiency really only matters for large array fields, which our model emits a lot of. Still makes a huge difference in that specific case, which is great to see.

1fish2 commented 1 year ago

Fantastic!

@prismofeverything mentioned rethinking how Vivarium reads & writes numpy arrays to Mongo. Could it let ndarray.tobytes() convert an array to a bytes in one native method call, write that blob to Mongo, and use numpy.frombuffer() to convert the blob back to an array? (It might need to add the array shape.)

If other Vivarium processes need to read portions of a large array, the writer could potentially partition it into smaller arrays to write or readers could read the entire array and get a view onto a portion of it or extract a range of bytes and convert that to an array.

thalassemia commented 1 year ago

That's an interesting idea. It would certainly improve serialization performance, but I'm not sure if it's worth the additional complexity. I also like the fact that our current model outputs are, barring the need to extract certain metadata from simData.pkl for interpretation, entirely language agnostic. Could be useful in case Julia ever does become a true competitor to Python.

1fish2 commented 1 year ago

Yes, that's a wild idea and I'm unsure about its tradeoffs.

But I'm betting on Mojo, not Julia.

Mojo is only ≈v0.1 so far, but it has very smart goals and foundations. It will be a superset of Python, part of the Python ecosystem, so it can run existing Python code and libraries. With some added language features, it can compile to fast machine code using MLIR (Multi-Level Intermediate Representation) to target the ever growing range of hardware accelerators (vectorization, multi-core, GPU, TPU), thus faster than C and Rust code. When ready, moving from Python to Mojo will be easier than Python 2 to 3.

Chris Lattner is a major force behind LLVM/Clang, Swift (to enable incrementally replacing and modernizing Objective C), MLIR, and Mojo.

thalassemia commented 1 year ago

Oh yeah, I recently watched a funny video about Mojo and found it super promising as well! I'm excited to see where Lattner and his team take it in the coming years.

CovertLab / vEcoli

Inefficiencies in PyMongo Driver #195