Closed qclassified closed 5 years ago
Problem:
For large raw data (e.g. misp event with ~2000 attributes) -> KeyError: 'pop from an empty set'
followed by pymongo.errors.AutoReconnect: connection closed
Reason:
Each tahoe Instance creates their own backend = MongoBackend because of the from tahoe import *
line at top of proc.analytics.filters.filt_misp
- the following code block in tahoe.instance.Instance
is not used:
class Instance():
backend = get_backend() if os.getenv("_MONGO_URL") else NoBackend()
As os.get_env is defined at the bottom of proc.analytics.filters.filt_misp
Instead they use the 1st line from tahoe.instance.Instance.__init__
class Instance():
backend = get_backend() if os.getenv("_MONGO_URL") else NoBackend()
def __init__(self, **kwargs):
if type(self.backend) == NoBackend and os.getenv("_MONGO_URL"): self.backend = get_backend()
...
However this is meant to be a fallback for testing only and creates separate backend=MongoBackend for all attributes and since there are ~2000 attributes alone not counting objects, events or session it soon overwhelms the mongodb server raising maximum allowed connections in a pool (maximum connections created from same machine)
Temporary solution: put following code block at beginning of script:
if __name__ == "__main__":
config = {
## "mongo_url" : "mongodb://cybexp_user:CybExP_777@134.197.21.231:27017/?authSource=admin",
"mongo_url" : "mongodb://134.197.21.231:27017/",
"mongo_url" : "mongodb://localhost:27017",
"analytics_db" : "tahoe_db",
"analytics_db" : "tahoe_demo",
"analytics_coll" : "instances"
}
os.environ["_MONGO_URL"] = config.pop("mongo_url")
os.environ["_ANALYTICS_DB"] = config.pop("analytics_db", "tahoe_db")
os.environ["_ANALYTICS_COLL"] = config.pop("analytics_coll", "instances")
and following code block at end of script:
if __name__ == "__main__":
filt_misp()
Permanent Preferred Solution:
Don't run proc.analytics.filters.filt_misp
or any filt_*
directly,
use ../analytics.py
as an entry point. Then all tahoe Instances will share a common backend and the processing will be much faster for smaller events and will not overwhelm mongodb for large events