Closed ericdill closed 7 years ago
How would we catch a timeout immediately?
Lol I don't mean try to catch a timeout immediately. I mean chexk to make sure we are connected before trying to insert
Absolutely! Unless u love getting calls at 2:45am that "bluesky crashed" :( catching stuff like this and providing concise specific messages is essential for providing an acceptable and sustainable level of support.
Would be awesome to have tests for such common subsystem problems and clear error mesgs in the test/fuzz/ci stuff. Ideally, hearing "bluesky crashed" would be exciting/interesting... cuz you'd only hear that when a never seen before problem occured.
-matt
@tacaswell There are apparently a couple of different timeout options that are set by default when you instantiate a new MongoClient. They are
socketTimeoutMS: (integer or None) Controls how long (in milliseconds) the
driver will wait for a response after sending an ordinary
(non-monitoring) database operation before concluding that a
network error has occurred. Defaults to None (no timeout).
connectTimeoutMS: (integer or None) Controls how long (in milliseconds) the
driver will wait during server monitoring when connecting a
new socket to a server before concluding the server is
unavailable. Defaults to 20000 (20 seconds).
serverSelectionTimeoutMS: (integer) Controls how long (in milliseconds) the
driver will wait to find an available, appropriate
server to carry out a database operation; while it
is waiting, multiple server monitoring operations
may be carried out, each controlled by
connectTimeoutMS. Defaults to 30000 (30 seconds).
waitQueueTimeoutMS: (integer or None) How long (in milliseconds) a thread will
wait for a socket from the pool if the pool has no free
sockets. Defaults to None (no timeout).
That being said, mucking with these timeout values does not seem like a good idea. It is good to know that they exist so that we can modify them in the off chance that some beamline complains about that their scan hangs for a while before it reports that it cannot insert any data. I am :-1: on mucking with these default values in the call to MongoClient in metadatastore.mds.MDS.
The only thing that I would want to do by default in metadatastore would be to catch the TimeoutError somewhere and print out a message that metadatastore can't connect and that whoever sees that TimeoutError should check to make sure that their mongodb is running on whatever server is configured to do so. Thoughts?
The other behavior to worry about is what happens if the user as successfully connected (so we have 'live' connection) but between when it was connected and the next call the mongo server falls over / is killed.
between when it was connected and the next call the mongo server falls over / is killed.
I guess there are a couple of things to consider in regards to what caused mongo to fall over.
(1) is pretty easy to prevent if we have proper monitoring tools. I guess that is the responsibility of Petkus and crew. I'm not sure how we would automatically recover from such a situation
(2) is preventable by having a redundant mongo service running somewhere so we don't lose data, but that sounds quite challenging to implement redundant mongo services the right way.
(3) We are well and truly screwed and data collection cannot continue since EPICS isn't working either.
No action for more than a year. Closing.
(I will note that we have operationalized the mongo servers better, but that better monitoring is still needed. This is out of scope for the software though; really it's an IT concern.)
@tacaswell suggest that we should catch this and reraise as something better. Could we detect this before it spins for ~60 seconds and raise immediately?