danielplohmann / mcrit

The MinHash-based Code Relationship & Investigation Toolkit (MCRIT) is a framework created to simplify the application of the MinHash algorithm in the context of code similarity.
GNU General Public License v3.0
83 stars 12 forks source link

MongoDB may throw an overflow error #37

Closed yankovs closed 1 year ago

yankovs commented 1 year ago

Hey!

Some reoccurring mongo related error pops up in mcrit server logs from time to time when running an indexing process that submits files to mcrit.

Not sure if this is a mongo issue or an issue in mcrit, but it seems to be related to the ID generation done in mcrit. Can some field in the metadata saved to mongo be bigger than the 8 byte integer limit in BSON?

mcrit-server                 | 2023-09-03 04:57:12 [FALCON] [ERROR] POST /samples => Traceback (most recent call last):
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 188, in _dbInsert
mcrit-server                 |     insert_result = self._database[collection].insert_one(self._toBinary(data))
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 671, in insert_one
mcrit-server                 |     self._insert_one(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 611, in _insert_one
mcrit-server                 |     self.__database.client._retryable_write(acknowledged, _insert_command, session)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1568, in _retryable_write
mcrit-server                 |     return self._retry_with_session(retryable, func, s, None)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1413, in _retry_with_session
mcrit-server                 |     return self._retry_internal(retryable, func, session, bulk)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/_csot.py", line 108, in csot_wrapper
mcrit-server                 |     return func(self, *args, **kwargs)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/mongo_client.py", line 1460, in _retry_internal
mcrit-server                 |     return func(session, conn, retryable)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/collection.py", line 599, in _insert_command
mcrit-server                 |     result = conn.command(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/helpers.py", line 315, in inner
mcrit-server                 |     return func(*args, **kwargs)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/pool.py", line 960, in command
mcrit-server                 |     self._raise_connection_failure(error)
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/pool.py", line 932, in command
mcrit-server                 |     return command(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/network.py", line 150, in command
mcrit-server                 |     request_id, msg, size, max_doc_size = message._op_msg(
mcrit-server                 |   File "/usr/local/lib/python3.8/dist-packages/pymongo/message.py", line 765, in _op_msg
mcrit-server                 |     return _op_msg_uncompressed(flags, command, identifier, docs, opts)
mcrit-server                 | OverflowError: MongoDB can only handle up to 8-byte ints
mcrit-server                 |
mcrit-server                 |
mcrit-server                 | During handling of the above exception, another exception occurred:
mcrit-server                 |
mcrit-server                 |
mcrit-server                 | Traceback (most recent call last):
mcrit-server                 |   File "falcon/app.py", line 365, in falcon.app.App.__call__
mcrit-server                 |   File "/opt/mcrit/mcrit/server/utils.py", line 51, in wrapper
mcrit-server                 |     func(*args, **kwargs)
mcrit-server                 |   File "/opt/mcrit/mcrit/server/SampleResource.py", line 126, in on_post_collection
mcrit-server                 |     summary = self.index.addReportJson(req.media, username=username)
mcrit-server                 |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 280, in addReportJson
mcrit-server                 |     return self.addReport(report, calculate_hashes=calculate_hashes, calculate_matches=calculate_matches, username=username)
mcrit-server                 |   File "/opt/mcrit/mcrit/index/MinHashIndex.py", line 265, in addReport
mcrit-server                 |     sample_entry = self._storage.addSmdaReport(smda_report)
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 585, in addSmdaReport
mcrit-server                 |     self._dbInsert("samples", sample_entry.toDict())
mcrit-server                 |   File "/opt/mcrit/mcrit/storage/MongoDbStorage.py", line 197, in _dbInsert
mcrit-server                 |     raise ValueError("Database insert failed.")
mcrit-server                 | ValueError: Database insert failed.
danielplohmann commented 1 year ago

Hey! Thanks for highlighting this. Now what's interesting is that this never happened to me in 10k+ files, which has me pretty confused right now. Looking at serialization of the SampleEntry, there's barely fields that I could imagine exceeding an 8 byte integer. If I had to guess that might be some uint/int issue but without further information it's really hard to figure this out. I'll have a look and will improve the overall logging so that we may get some more telemetry and are able to identify a reproducible case upon which I can investigate and hopefully fix this issue. I'll ping you here once MCRIT is updated with said extended logging.

yankovs commented 1 year ago

We're at 60k files and more than 50 million functions, pushing this system to the limit as it seems 🦔.

Thank you for the quick response, I'll be glad to help if I can regarding this issue

danielplohmann commented 1 year ago

Oh, haha, I see. :D With what specs are you running this (cores, ram, disk?). And what parts appear to become especially slow?

I mean, the default configuration given in the docker repo is pretty much tailored to what I expected to be the envisioned use case (up to ~10k files, i.e. on manually curated data sets) but I would expect it to still work fine for a low multiple of that.

As you seem to aim for way beyond more than magnitude in size, I would still expect that changing some of the parameters should still yield tolerable performance. Biggest impact should have increasing the default lower threshold to matches (MinHashConfig.MINHASH_MATCHING_THRESHOLD) to something like 70 and lowering the bands used in Minhash indexing (StorageConfig._default_storage_bands) to something like {4: 10} or {5: 10} - this requires re-indexing though.
Both should help increase the matching speed and reduce the output report size. There are some more parameters that are currently not exposed in configs that could also help when there is bigger hardware available.

yankovs commented 1 year ago

So, currently our setup is three EC2 instances: one for mongodb alone (4 cores/16gb/disk that should be fast enough), one for server/15 worker replicas (16 cores/64gb) and a pretty small one for nginx/web.

The first immediate thing I noticed is that when having a lot (10+) of submitter processes (submitting files to be indexed in mcrit), the default number of threads in waitress (8) was not enough. This made using the web basically not usable while submitting files, it would take a lot of time for even simple queries. Adding threads=10 to the serve method in waitress solved this.

The next pretty immediate thing is that some parts of the code don't scale well. For example, in the statistics part of the web interface, the underlying operation is to run countDocuments on samples, families, etc. This works well when numbers are relatively small but for tens of millions of functions, functions.countDocument() just takes a very very long time. In this case functions.stats().count or functions.estimatedDocumentCount() is immediate but not as accurate (from my understanding). Another approach is to create a counts document that is linked to function/sample collections and updates on insert to them.

Also, it seems like the workers don't quite work to their full capacity. Those 15 workers we have don't really work 100% in parallel even though there are more than 15 updateMinHashForSample jobs queued up. Could be related to the config options you mentioned.

danielplohmann commented 1 year ago

Sounds like a nice setup!

The first immediate thing I noticed is that when having a lot (10+) of submitter processes (submitting files to be indexed in mcrit), the default number of threads in waitress (8) was not enough. This made using the web basically not usable while submitting files, it would take a lot of time for even simple queries. Adding threads=10 to the serve method in waitress solved this.

Do you mean this one? https://github.com/danielplohmann/mcrit/blob/2ba6f0861f60fdca183c5e2ebbb483216e8c8e5c/mcrit/__main__.py#L27

Then I will make threads a configurable parameter as well, sounds very reasonable.

The next pretty immediate thing is that some parts of the code don't scale well. For example, in the statistics part of the web interface, the underlying operation is to run countDocuments on samples, families, etc. This works well when numbers are relatively small but for tens of millions of functions, functions.countDocument() just takes a very very long time. In this case functions.stats().count or functions.estimatedDocumentCount() is immediate but not as accurate (from my understanding). Another approach is to create a counts document that is linked to function/sample collections and updates on insert to them.

Funny enough, we noticed that at some point as well and had introduced internal counters in the family collection. I just noticed that we never updated the method to deliver statistics after this. This was just addressed in mcrit 1.0.19, pushed an hour ago. ;) Yes indeed, this already dropped the response time for that API call on my instance from 60 sec to <5 sec. Not sure what to do about the aggregation of PicHashes though. Those will remain as a slow component in the statistics but they are also pretty useful when trying to estimate how much "unique" functions are in the DB.

Also, it seems like the workers don't quite work to their full capacity. Those 15 workers we have don't really work 100% in parallel even though there are more than 15 updateMinHashForSample jobs queued up. Could be related to the config options you mentioned.

This may also be related to how the queue works, I would need to replicate that with a bigger setup myself to get some introspection on that. Normally, both minhash calculation and matching should be fully parallelized.

danielplohmann commented 1 year ago

Oh and the original issue with the mongo error above is indeed related to int64/uint64, or rather BSON not being able to store uint64 values. My guess is that you have some binaries/smda reports where the base address is above 0x7FFFFFFFFFFFFFFF. I will look into the conversion of such problematic values, at least for the purpose of database storage.

UPDATE: I was able to replicate the issue with a crafted SMDA report. Therefore, the original issue was now fixed in the just published mcrit 1.0.20. I am now converting potentially large values (base addresses, function offsets) in two complement for storage to achieve BSON int64 compatibility.