Wait for daemon to fully terminate in "stop"

patrickfreed commented 4 years ago

Right now, mongo-orchestration stop on non-Windows platforms simply sends a signal to the mongo-orchestration daemon and exits immediately. In the background, the mongo child processes and the daemon itself will shut down after a brief period.

This PR updates the "stop" command to wait until the daemon has fully shut down before returning, effectively causing mongo-orchestration stop to block until both the daemon and the cluster are no longer running.

As part of this change, I added a dependency on psutil to facilitate polling whether the process is still alive or not. If adding a dependency is considered costly, that functionality could be reimplemented in mongo-orchestration. The library does seem like it could be useful in other areas of the codebase, however.

I was having trouble running the full test suite (with or without my changes) due to test_replica_sets.ReplicaSetAuthTestCase hanging, so unfortunately I do not know if this patch passes completely. I did test it manually and it seems to work, though.

patrickfreed commented 4 years ago

Are you sure that's the case? The log and db files seem to stick around for me after I shut the cluster down. When I start up a new cluster the old ones are deleted, however.

ShaneHarvey commented 4 years ago

Yes the order is:

SIGTERM
cleanup_storage
Servers().cleanup()
process.cleanup_mprocess()
remove dbpath, logPath:https://github.com/10gen/mongo-orchestration/blob/8159c946e5aa882de9b69b20e909dfd89fab5706/mongo_orchestration/process.py#L274-L282

When I run mongo-orchestration stop, the log files are gone after the daemon exits.

patrickfreed commented 4 years ago

Hm, something must be messed up on my end. Well, given that, I don't think we can merge this patch, since it would seemingly break everyone's log storage. I think as you suggest we'll have to take an approach outside of mongo-orchestration.

I originally planned on doing this by adding a separate script to drivers-evergreen-tools that issues a `curl DELETE; doing that wouldn't cause mo to delete the log/data files too, right?

patrickfreed commented 4 years ago

Also, what do you think about like a --preserveMongoData option that would toggle that cleanup behavior?

ShaneHarvey commented 4 years ago

DELETE would also remove the log files. MO deletes the files anytime it shuts down a cluster:

head /var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr/mongod.log
2020-01-21T14:36:25.702-0800 I  CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2020-01-21T14:36:25.703-0800 W  ASIO     [main] No TransportLayer configured during NetworkInterface startup
2020-01-21T14:36:25.708-0800 W  ASIO     [main] No TransportLayer configured during NetworkInterface startup
2020-01-21T14:36:25.709-0800 D1 NETWORK  [main] fd limit hard:524288 soft:262144 max conn: 209715
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] MongoDB starting : pid=35243 port=27017 dbpath=/var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr 64-bit host=Shanes-MacBook-Pro-2.local
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] db version v4.3.2-594-g24a8f59
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] git version: 24a8f59e21490f4dd101224ba01872d773c7f25e
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] allocator: system
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] modules: enterprise
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] build environment:
$ curl -XDELETE http://localhost:8889/servers/e82cbabf-e6e8-406a-be80-2e8ee9c0fcac
$ head /var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr/mongod.log
head: /var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr/mongod.log: No such file or directory

ShaneHarvey commented 4 years ago

I think a simpler solution might be to copy the files you're interested in before uploading them. Would that avoid the tar failures you're seeing?

Well, given that, I don't think we can merge this patch, since it would seemingly break everyone's log storage

It wouldn't break Python's log upload task because we upload the logs before shutting down mongo-orchestration:

  - func: "upload mo artifacts"
  - func: "upload test results"
  - func: "stop mongo-orchestration"

mbroadst commented 4 years ago

@ShaneHarvey I think the whole problem is that those files might still be written to, and tar would error out because bytes were changing while it was trying to do its job. I think its reasonable that we try to find a more comprehensive solution to this problem.

patrickfreed commented 4 years ago

Here's a patch confirming what @mbroadst said. I think most drivers run "upload-mo-artifacts" in post (which suppresses failures), so they don't see these errors. I was hoping that the logs being append-only while upload-mo-artifacts was being run would prevent the issue, but that appears to not be the case.

ShaneHarvey commented 4 years ago

Yeah I think it's fine to not cleanup the files after shutting down the clusters in the cleanup_storage signal handler.

patrickfreed commented 4 years ago

So I discovered why the logs were apparently not being deleted on my machine:

In the code snippet you pasted above (cleanup_mprocess), it attempts to delete cfg["logPath"], but the log path is actually stored at cfg["logpath"], not in camel case (see here). By default, the log files are written to dbpath/mongod.log, so if you don't specify a log path it does usually get deleted. The configurations I use specify log paths, so the logs don't get deleted.

How about we make this behavior official and remove the attempted log path deletion from cleanup? The nice thing is since this won't have any behavioral changes (this bug has been in there for 7+ years), we know we won't break any evergreen configs. Then I won't need to modify anything else and can just use the blocking stop changes already made thus far in combination with a custom log path.

ShaneHarvey commented 4 years ago

SGTM. Can you open a follow-up issue to not delete the default logpath (dbpath/mongod.log) on shutdown too?

ShaneHarvey commented 4 years ago

One note. I've noticed that shutting down clusters can take a long time, especially with newer server versions. If that proves to be the case in Evergreen, we may need to optimize stop to shutdown the clusters forcefully with SIGTERM instead of using the shutdown command.

patrickfreed commented 4 years ago

I wasn't able to run the entire test suite locally due to similar issues found in #273, but I did run drivers-evergreen-tools and the swift driver's evergreen's with this patch and they passed fine. As you pointed out, the tasks that use latest sharded clusters can take up to 30s longer now that it waits for shutdown, but in my experience that is usually only a small percentage of the total time a task takes to run. Do you think that's sufficient testing or should I try some more drivers out?

ShaneHarvey commented 4 years ago

Yes I think that's fine. I opened https://github.com/10gen/mongo-orchestration/issues/278 to make mongo-orchestration stop faster.

patrickfreed commented 4 years ago

It looks like I don't have write access to this repository, so could you squash/merge it for me?

10gen / mongo-orchestration

Wait for daemon to fully terminate in "stop" #276