cuckoosandbox / cuckoo

Cuckoo Sandbox is an automated dynamic malware analysis system
http://www.cuckoosandbox.org
Other
5.53k stars 1.7k forks source link

Cuckoo fails to store in standard MongoDB with certain reports #358

Open SwissKid opened 10 years ago

SwissKid commented 10 years ago

2014-07-28 10:23:16,041 [lib.cuckoo.core.plugins] ERROR: Failed to run the reporting module "MongoDB": Traceback (most recent call last): File "/home/cuckoo/cuckoo/lib/cuckoo/core/plugins.py", line 499, in process current.run(self.results) File "/home/cuckoo/cuckoo/modules/reporting/mongodb.py", line 195, in run self.db.analysis.save(report) File "/usr/lib/python2.7/dist-packages/pymongo/collection.py", line 228, in save return self.insert(to_save, manipulate, safe, **kwargs) File "/usr/lib/python2.7/dist-packages/pymongo/collection.py", line 306, in insert continue_on_error, self.__uuid_subtype), safe) File "/usr/lib/python2.7/dist-packages/pymongo/connection.py", line 732, in _send_message (request_id, data) = self.__check_bson_size(message) File "/usr/lib/python2.7/dist-packages/pymongo/connection.py", line 709, in __check_bson_size (max_doc_size, self.__max_bson_size)) InvalidDocument: BSON document too large (17837322 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.

Suggested Fix: Might be possible to fix using gridfs (Maybe? I just saw it in a stackexchange post) or some other limiter. Or recompiling mongodb with a patch to increase this limit.

Either way, a check should be put in place to prevent this error from occurring.

SwissKid commented 10 years ago

Additional Note: Getting this error with bba4e9627554fef3476b1bea0d52763c442e75e50c9584bcfb012aabf203f05a from malwr.com, so theoretically could be reproduced on any system with that same malware.

botherder commented 10 years ago

Yes, that happens if the report is too big to be stored inside Mongo. We tried to minimize that by splitting the behavioral section, but sometimes other sections might as well be too mig.

SwissKid commented 10 years ago

I believe it's the Memory Analysis section with this piece. Might want to split that as well?

jekil commented 10 years ago

Is it possible to you to identify the section "too big" or share with us the sample? Is it bba4e9627554fef3476b1bea0d52763c442e75e50c9584bcfb012aabf203f05a on malwr?

SwissKid commented 10 years ago

It should be that one on malwr, since that's where I got it. It triggers when volatility is turned on with default settings. I also have yara rules in place, but I doubt those are tipping it over.

SwissKid commented 10 years ago

I can include any config files you're interested in, or a listing of any directories.

botherder commented 10 years ago

Yes makes sense. Some sections of the Volatility report might be massive sometimes. I encountered a similar issue before.

rep commented 9 years ago

I retried to analyze the mentioned sample - everything worked like a charm, including mongo reporting with enabled volatility processing (memory analysis).

I suspect that this was related to even bigger behavior logs on this file with full networking / endpoints at the time. Indeed the report is quite huge right now but for me at least it fits in mongo.

Overall the statement is: while we know about the possibility of too big reports, there's not much we can do about it right now without changing the semantics of our database storage scheme.

My suggestion long-term would be to store volatility results in a separate collection, linking back to the analysis report. That should improve the situation for most samples. However this is breaking backwards compat and thus won't make it into this release.

begosch commented 9 years ago

I have encountered this without volatility enabled. Why can't we just store reports in mongodb using GridFS?

jekil commented 9 years ago

@begosch because it is useless, a file system storage would be better at that point. Can you please share (even privately) the sample?

kcchu commented 9 years ago

FYI, I bumped into the same issue with 74678a11c3d3fe69718289fbb95ec3fe734347e5ec2a8f0c9ecf1b9a179cd89c

I still have the analysis output on my disk, which is 458M in total. Please let me know if it is useful to you.

2015-05-31 00:22:05,855 [lib.cuckoo.core.plugins] ERROR: Failed to run the repor
ting module "MongoDB":
Traceback (most recent call last):
  File "/var/local/cuckoo/cuckoo/lib/cuckoo/core/plugins.py", line 505, in proce
ss
    current.run(self.results)
  File "/var/local/cuckoo/cuckoo/modules/reporting/mongodb.py", line 215, in run
    self.db.analysis.save(report)
  File "/var/local/cuckoo/py/local/lib/python2.7/site-packages/pymongo/collection.py", line 285, in save
    return self.insert(to_save, manipulate, safe, check_keys, **kwargs)
  File "/var/local/cuckoo/py/local/lib/python2.7/site-packages/pymongo/collection.py", line 409, in insert
    gen(), check_keys, self.uuid_subtype, client)
DocumentTooLarge: BSON document too large (17321912 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
KillerInstinct commented 9 years ago

The above too errors are basically unhandleable at Cuckoo's current state. It really requires a rewrite of some of the processing modules. I wrote up some code to help debug these (and optionally delete overly large keys) -- I suggested to upstream in IRC that it be implimented to help debug which processing modules are to fault, especially for custom installs/modules. I specifically observed this with both behavior and volatility processing modules. I personally use the deletion feature because I prefer some data over no data. :)

Feel free to reference / use / rip: https://github.com/KillerInstinct/cuckoo-modified/commit/ac8ecf8bcae5ff9d47629d0def92658cf86e644f

KillerInstinct commented 9 years ago

Again, lost some data, as opposed to lost all data for an analysis. At cuckoo's current state it's one or the other. The best solution in my opinion would be to modify processing modules to not allocate >16Mb per key. No need to migrate to another data store for a problem that happens relatively rarely against most modern malware samples.

jbremer commented 9 years ago

This issue has been resolved, right? Or should we also split up other parts of the report a bit more (just like we split behavioral logs into pieces)?

jbremer commented 8 years ago

Haven't ran into this issue for quite a while, so going to assume it's fixed for now.

GelosSnake commented 8 years ago

I've got this error today "BSON document too large (18064026 bytes) - the connected server supports BSON document sizes up to 16777216 bytes. " Seems like the same issue. Related sample sha256 hash: 1ccc286d33d3fec1853e8f4c17eb7faea390725a8cfe03d23944eedc5bf8d58c
https://malwr.com/submission/status/N2I4ZThmOWRlODZlNDAyNmIwNjNhYjkzYWI3NjQ0ZTI/ https://malwr.com/submission/status/ZmFkN2YyMzE2OGZjNDZkNTk5MGIyYjVmMjAxYjZiNTU/

doomedraven commented 8 years ago

i got it today with 2.0-dev

any possible solution? i checked this http://stackoverflow.com/a/25553887

ERROR:lib.cuckoo.core.plugins:Failed to run the reporting module "MongoDB":
Traceback (most recent call last):
  File "/opt/cuckoo/utils/../lib/cuckoo/core/plugins.py", line 506, in process
    current.run(self.results)
  File "/opt/cuckoo/utils/../modules/reporting/mongodb.py", line 227, in run
    self.db.analysis.save(report)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 2182, in save
    check_keys, manipulate, write_concern)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 530, in _insert
    check_keys, manipulate, write_concern, op_id, bypass_doc_val)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 512, in _insert_one
    check_keys=check_keys)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/pool.py", line 218, in command
    self._raise_connection_failure(error)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/pool.py", line 346, in _raise_connection_failure
    raise error
DocumentTooLarge: BSON document too large (18575204 bytes) - the connected server supports BSON document sizes up to 16793598 bytes.
jbremer commented 8 years ago

@doomedraven Could you share a report.json for this analysis?

doomedraven commented 8 years ago

@jbremer for sure, here you have it, 1gb json O_o

https://www.dropbox.com/s/q4l5zidzfhlmd50/report.json.zip?dl=1

kholbrook1303 commented 8 years ago

@jbremer certainly dont want to flood this issue, but I am seeing this more frequently with Ransomware (Specifically Locky). Likely due to the massive amount of File I/O.

Here is the JSON report: https://www.dropbox.com/s/agv77vutwnon4ro/report.json?dl=0

keithjjones commented 7 years ago

I also ran into this limit today.

Sample hash: 9b462800f1bef019d7ec00098682d3ea7fc60e6721555f616399228e4e3ad122 https://www.virustotal.com/en/file/9b462800f1bef019d7ec00098682d3ea7fc60e6721555f616399228e4e3ad122/analysis/

keithjjones commented 7 years ago

I'm actually seeing this issue all over the place. Most cases, it is ransomware. Is it worth reopening since this is still happening in 2-RC2?

jbremer commented 7 years ago

Ah, I can see that happening (if not for different reasons than before). Thanks for reporting @keithjjones. Let's reopen this issue, indeed.

keithjjones commented 7 years ago

@jbremer you're the best! Thanks!

KillerInstinct commented 7 years ago

The issue will never be solved unless all of the modules that write to the final json 'results' dict absolutely ensure that they do not exceed 16MB of memory per document (per module or per key:value in the final json)

You have a couple of options (that come to mind for me at least): 1) You can split up various modules into 'sub modules' to be reorganized again in the web interface later. This requires code changes on both sides (module and web interface) and also doesn't give you a definite guarantee that a document won't be over 16MB.

2) Introduce some state manager which monitors the output of a module and decides if it needs to split the values into multiple keys. (Think summary-1, summary-2 if the summary key was 18MB) This would require a pretty significant overhaul to the scheduler I'd imagine with some minor changes to the Web UI.

This was much of the reason I wrote up the workaround for just removing problem keys. Since API logs are chunked, you can afford to lose summary (and in many cases memory) data instead of the entire analysis.

keithjjones commented 7 years ago

FWIW - I turned off "save memory" in the mongodb section and it didn't even come close to getting under the 16MB limit.

I'm not that familiar with the inner workings of the mongo storage methods in cuckoo, but couldn't the whole report be chopped up with gridfs if size is a problem? I'm sure you've already thought of that, but if not, I mention it for discussion.

jbremer commented 7 years ago

I was considering option 1 earlier, but there are plenty of direct backwards incompatibility issues related to that. I'll have to think about it a little bit. Regarding this specific issue - we could also do a workaround by filtering out all affected files from the ransomware encryption stage.

keithjjones commented 7 years ago

FYI - I'm seeing it on more than ransomware, but ransomware seems to be the most consistent.

KillerInstinct commented 7 years ago

If you store the report in gridfs then you can't search it at all. So for people who don't run ES you'd never be able to load the analysis at all as the webif loads from mongo. You could adapt it to load from gridfs but then you can't search it (via searching for some tag or regkey or dropped file, etc). It would also be a lot slower as for large analysis Mongo would have to 'rechunk it' together to display it to the webif.

Also, it's not just ransomware. You cold make a very simple program to create 10 million random registry keys or files and you'd run into this issue every single analysis you detonated it in.

keithjjones commented 7 years ago

Isn't ES already a requirement if you want any type of searching?

KillerInstinct commented 7 years ago

Ahh right, forgot they switched to ES for searching to make real-time searching easier. It can be done with GridFS. I wrote some code and tested it out and it works OK-ish. I believe since the API calls are stored in ES that would alleviate some of the delay. As far as ransomware you'd still have a sluggish load because in order to load from GridFS you have to generate the content again. EG Connect to mongo, search for the GridFS object and then load it into the view. This is time consuming especially with large summaries. It would probably be better at that point to just power Django off the .json file and have an option to buffer the entire thing into memory, or stream it in section by section. Either way it would require a lot of change (to jbremer's point likely breaking backwards compat) for what is really just a poor work around for my option 2 in the earlier comment.

keithjjones commented 7 years ago

What possible workarounds are there for when this happens? Is it possible to still pull a report.html/report.json remotely if the database is not populated? In my environment, users don't have file system level access to the Cuckoo instances to pull the reports for themselves.

keithjjones commented 7 years ago

I was reviewing the code here:

https://github.com/cuckoosandbox/cuckoo/blob/master/modules/reporting/mongodb.py#L253

It looks like the report is just plugged into the database as one of the last steps. I think trying to make every report be less than 16MB in every case is a much harder task than keeping the full report in a consistent storage facility. Right now, the behavior is chopped up into GridFS, it looks like whole files and pcaps are added to GridFS, and since ES is a requirement now to search, why not just add the report with GridFS as well? Wouldn't that be an easier fix? Is there a reason a report can't be entered into GridFS? Where would I go to check out the code that requires a report not to be in GridFS?

I'd like to lend a hand if possible. Right now I'm trying to get familiar with large code base. I run into this storage issue almost every day now, multiple times per day, and I can't find a Cuckoo temporary work around other than run it in another type of sandbox. Running another sandbox requires you to spread your time, programming, and resources instead of having them all dedicated to Cuckoo.

HufkensJeroen commented 7 years ago

I am running some cryptomalware in cuckoo. The issue we have is, that we still run against the upload size limit for BSON files. When looking for a solution I came across this thread. Has there been found a solution or work around for this?

Kind regards, Jeroen

netpacket commented 7 years ago

While running windows installer, I am running into the upload size limit for BSON files. Could behavior analysis log size be modified in config file? The option to turn on or off is available.

doomedraven commented 7 years ago

It has fixed 16mb limit

netpacket commented 7 years ago

Thanks. I guess I'll need to run the analysis without behavioral analysis.

wroersma commented 6 years ago

This has been a major issue for a while with dozen's of files with no proper fix yet...

netpacket commented 6 years ago

@doomedraven Do you think we have config file that lets the user modify the size?

doomedraven commented 6 years ago

if you google you will see what you can't change that, i have posted many times my mongo.py reporting with fix, do search in issues

netpacket commented 6 years ago

@doomedraven I did not realize this was mongo issue not Cuckoo Sandbox setup infrastructure thing. Thanks.

doomedraven commented 6 years ago

no, that isn't related to cuckoo, well cuckoo generate a lot of output sometimes depend on the sample, but meh, that is so easy to fix, and posted fix so many time so idk why that still not merged if so many people complains, but you can easilly fix that

netpacket commented 6 years ago

Gotcha. I have moved over to postgre sql db for now... so yeah, I can go into the file and make the change.

doomedraven commented 6 years ago

you can't use psql for webgui

netpacket commented 6 years ago

Wait, psql is necessary if you want to have multiple vm, correct? Then, webgui is not properly hooked with the new db? This is sort of infrastructure flaw if this is the case. Correct me if I am not understanding how the Cuckoo infrastructure.

doomedraven commented 6 years ago

no, did you check the manual? *sql part is only to manage tasks, mongo for webgui

netpacket commented 6 years ago

@doomedraven I totally brainfarted. Oops, the tasks on different db than mongodb is for web stuff.

doomedraven commented 6 years ago

read the manual

netpacket commented 6 years ago

I did. Tasks are on SQlite by default and webgui on mongodb. The issue lies on the mongodb with upload size. I think I get it. I upgraded the SQLite db to postgresql db. Thanks.

SparkyNZL commented 6 years ago

No, just don't use sql lite, I use MySQL with 15 machines in the pool and it works fine.

On Wed, Jan 31, 2018 at 6:25 AM, netpacket notifications@github.com wrote:

Wait, psql is necessary if you want to have multiple vm, correct? Then, webgui is not properly hooked with the new db? This is sort of infrastructure flaw if this is the case. Correct me if I am not understanding how the Cuckoo infrastructure.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cuckoosandbox/cuckoo/issues/358#issuecomment-361668456, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_imGoyWOk9cu5_FPxK0NpOugKuhypjks5tP1B2gaJpZM4CRnyK .