ServerSelectionTimeoutError if no mongo service is running

ericdill commented 8 years ago

---------------------------------------------------------------------------
ServerSelectionTimeoutError               Traceback (most recent call last)
<ipython-input-8-85a74561894d> in <module>()
----> 1 uid = run_start(sf1, 'ixs', str(uuid.uuid4()))

<ipython-input-7-990d0341c391> in run_start(specscan, beamline_id, uid, **md)
      9     }
     10     run_start_dict.update(**md)
---> 11     return insert_run_start(**run_start_dict)

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/metadatastore/commands.py in inner(*args, **kwargs)
    118         port = int(conf.connection_config['port'])
    119         db_connect(database=database, host=host, port=port)
--> 120         return func(*args, **kwargs)
    121     return inner
    122 

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/metadatastore/commands.py in insert_run_start(time, scan_id, beamline_id, uid, owner, group, project, **kwargs)
    612                          project=project, **kwargs)
    613 
--> 614     run_start = run_start.save(validate=True, write_concern={"w": 1})
    615 
    616     _cache_run_start(run_start.to_mongo().to_dict())

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/mongoengine-0.10.5-py3.5.egg/mongoengine/document.py in save(self, force_insert, validate, clean, write_concern, cascade, cascade_kwargs, _refs, save_condition, **kwargs)
    314 
    315         try:
--> 316             collection = self._get_collection()
    317             if self._meta.get('auto_create_index', True):
    318                 self.ensure_indexes()

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/mongoengine-0.10.5-py3.5.egg/mongoengine/document.py in _get_collection(cls)
    207                 cls._collection = db[collection_name]
    208             if cls._meta.get('auto_create_index', True):
--> 209                 cls.ensure_indexes()
    210         return cls._collection
    211 

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/mongoengine-0.10.5-py3.5.egg/mongoengine/document.py in ensure_indexes(cls)
    763 
    764                 if IS_PYMONGO_3:
--> 765                     collection.create_index(fields, background=background, **opts)
    766                 else:
    767                     collection.ensure_index(fields, background=background,

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/pymongo/collection.py in create_index(self, keys, **kwargs)
   1378         keys = helpers._index_list(keys)
   1379         name = kwargs.setdefault("name", helpers._gen_index_name(keys))
-> 1380         self.__create_index(keys, kwargs)
   1381         return name
   1382 

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/pymongo/collection.py in __create_index(self, keys, index_options)
   1284         index.update(index_options)
   1285 
-> 1286         with self._socket_for_writes() as sock_info:
   1287             cmd = SON([('createIndexes', self.name), ('indexes', [index])])
   1288             try:

/home/edill/miniconda/envs/ixstools/lib/python3.5/contextlib.py in __enter__(self)
     57     def __enter__(self):
     58         try:
---> 59             return next(self.gen)
     60         except StopIteration:
     61             raise RuntimeError("generator didn't yield") from None

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/pymongo/mongo_client.py in _get_socket(self, selector)
    710     @contextlib.contextmanager
    711     def _get_socket(self, selector):
--> 712         server = self._get_topology().select_server(selector)
    713         try:
    714             with server.get_socket(self.__all_credentials) as sock_info:

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/pymongo/topology.py in select_server(self, selector, server_selection_timeout, address)
    140         return random.choice(self.select_servers(selector,
    141                                                  server_selection_timeout,
--> 142                                                  address))
    143 
    144     def select_server_by_address(self, address,

/home/edill/miniconda/envs/ixstools/lib/python3.5/site-packages/pymongo/topology.py in select_servers(self, selector, server_selection_timeout, address)
    116                 if server_timeout == 0 or now > end_time:
    117                     raise ServerSelectionTimeoutError(
--> 118                         self._error_message(selector))
    119 
    120                 self._ensure_opened()

ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused

@tacaswell suggest that we should catch this and reraise as something better. Could we detect this before it spins for ~60 seconds and raise immediately?

tacaswell commented 8 years ago

How would we catch a timeout immediately?

ericdill commented 8 years ago

Lol I don't mean try to catch a timeout immediately. I mean chexk to make sure we are connected before trying to insert

cowanml commented 8 years ago

Absolutely! Unless u love getting calls at 2:45am that "bluesky crashed" :( catching stuff like this and providing concise specific messages is essential for providing an acceptable and sustainable level of support.

Would be awesome to have tests for such common subsystem problems and clear error mesgs in the test/fuzz/ci stuff. Ideally, hearing "bluesky crashed" would be exciting/interesting... cuz you'd only hear that when a never seen before problem occured.

-matt

ericdill commented 8 years ago

@tacaswell There are apparently a couple of different timeout options that are set by default when you instantiate a new MongoClient. They are

socketTimeoutMS: (integer or None) Controls how long (in milliseconds) the 
                 driver will wait for a response after sending an ordinary 
                 (non-monitoring) database operation before concluding that a 
                 network error has occurred. Defaults to None (no timeout).
connectTimeoutMS: (integer or None) Controls how long (in milliseconds) the 
                  driver will wait during server monitoring when connecting a 
                  new socket to a server before concluding the server is 
                  unavailable. Defaults to 20000 (20 seconds).
serverSelectionTimeoutMS: (integer) Controls how long (in milliseconds) the 
                          driver will wait to find an available, appropriate 
                          server to carry out a database operation; while it 
                          is waiting, multiple server monitoring operations 
                          may be carried out, each controlled by 
                          connectTimeoutMS. Defaults to 30000 (30 seconds).
waitQueueTimeoutMS: (integer or None) How long (in milliseconds) a thread will 
                    wait for a socket from the pool if the pool has no free 
                    sockets. Defaults to None (no timeout).

That being said, mucking with these timeout values does not seem like a good idea. It is good to know that they exist so that we can modify them in the off chance that some beamline complains about that their scan hangs for a while before it reports that it cannot insert any data. I am :-1: on mucking with these default values in the call to MongoClient in metadatastore.mds.MDS.

The only thing that I would want to do by default in metadatastore would be to catch the TimeoutError somewhere and print out a message that metadatastore can't connect and that whoever sees that TimeoutError should check to make sure that their mongodb is running on whatever server is configured to do so. Thoughts?

tacaswell commented 8 years ago

The other behavior to worry about is what happens if the user as successfully connected (so we have 'live' connection) but between when it was connected and the next call the mongo server falls over / is killed.

ericdill commented 8 years ago

between when it was connected and the next call the mongo server falls over / is killed.

I guess there are a couple of things to consider in regards to what caused mongo to fall over.

disk is full. IIRC we experienced this at HXN when their xspress3 detector server was full and filestore was barfing on weird errors.
server where mongo is running fell over
network has totally crashed and nothing is accessible.

(1) is pretty easy to prevent if we have proper monitoring tools. I guess that is the responsibility of Petkus and crew. I'm not sure how we would automatically recover from such a situation

(2) is preventable by having a redundant mongo service running somewhere so we don't lose data, but that sounds quite challenging to implement redundant mongo services the right way.

(3) We are well and truly screwed and data collection cannot continue since EPICS isn't working either.

danielballan commented 7 years ago

No action for more than a year. Closing.

danielballan commented 7 years ago

(I will note that we have operationalized the mongo servers better, but that better monitoring is still needed. This is out of scope for the software though; really it's an IT concern.)

NSLS-II / metadatastore

ServerSelectionTimeoutError if no mongo service is running #202