csirtgadgets / massive-octo-spice

DEPRECATED - USE v3 (bearded-avenger)
https://github.com/csirtgadgets/bearded-avenger-deploymentkit/wiki
GNU Lesser General Public License v3.0
227 stars 62 forks source link

CIF server won't start after upgrade #471

Closed coonsmatthew closed 7 years ago

coonsmatthew commented 7 years ago

Hello all,

I just upgraded our development CIF server from 2.00 RC14 to 2.00.06.

The upgrade seemed to go well, I didn't notice any noticeable errors during the upgrade process.

However, after the upgrade completes, 3 out of 4 CIF services will not start:

I checked the logs for the various CIF services, but I don't see anything that indicates an error, it's almost like there are not any logs being generated.

When I run "cif --help", this is what I see:

user@server:~$ cif --help Attempt to reload POSIX.pm aborted. Compilation failed in require at /usr/local/lib/perl/5.18.2/DateTime.pm line 18. BEGIN failed--compilation aborted at /usr/local/lib/perl/5.18.2/DateTime.pm line 18. Compilation failed in require at /usr/local/bin/cif line 14. BEGIN failed--compilation aborted at /usr/local/bin/cif line 14.

Any idea what I might have done wrong? I tried installing from both the Master and Develop branches, but I get the same error.

Elasticsearch is running, and I've tried restarting the service without a change in results.

coonsmatthew commented 7 years ago

Hmmm, so after a bit of additional investigation, it appears that the DateTime.pm module is trying to use the perl module "POSIX qw(floor)" and appears to be failing to load it.

I've verified the perl version on prod and dev and they are both the same.

Interestingly, if I run sudo cif -p, It seems that I can access the webserver:

sudo cif -p
[2017-02-01T11:00:02,157Z][WARN]: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
</body></html>
[2017-02-01T11:00:02,158Z][FATAL]: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
</body></html>

I'm trying to determine if the issue is server or CIF related. The server was recently upgraded, and I'm unsure if the server upgrade broke CIF.

wesyoung commented 7 years ago

hrm, that's an odd one. i do remember seeing some odd things happen with the DateTime perl modules in the past (i even think we started hard coding some of the versions after the RC phase).

once you start running into system dep issues like that; (esp with perl) they're really hard to debug (one of the reasons we've moved to python for v3). one of the ways we've gotten around it is to containerize these parts from the elasticsearch service so we can just roll new install's each time we deploy updates (we're using AWS for production).

my sense is; if you're getting POSIX type errors; some lower level stuff might be out of whack (not sure which upgrade might have caused it) and it might make sense to just backup your ES data and rebuild the server with the latest release(?) (backing up ES data is pretty easy, just tar up the directories, etc).

the alternative may be to give CIFv3 a try run on a small vm to see if the SQLite version suits your needs as we get ready to enter the beta stages. much easier to debug; and much less of a chance that these kinds of upgrades will break the system (and it's at-least easier to rebuild if/when you do).

does that make sense?

coonsmatthew commented 7 years ago

That does make sense. Thanks Wes. I can rebuild and test again.

Is the idea to stay with SQLite in V3 or to eventually move to Elasticsearch? We're utilizing our CIFv2 production server heavily and I wanted to be sure that I wasn't limited by potential SQLite limitations as opposed to Elasticsearch.

Thanks.

wesyoung commented 7 years ago

both options are there; we've put a ton of initial work into the SQLite version (make it fast, easy for most people to use). the elasticsearch parts are there; and they do work but we have a ways to go as far as performance is concerned (not bad- just a ways to go).

v3 is architected a bit differently (everything is aggregated by default, so much less data, easier to deploy, blow away, re-init, etc), and i'd actually be surprised if you can knock over the sqlite version... (would be a good use cause) cause we've added even more feeds than you'll find in v2 and are trying to knock it over ourselves.. :)

(sqlite is stupid fast when you do it right, it just gets funky when you go multi-user..).

coonsmatthew commented 7 years ago

I will check it out! This may be the wrong place for this discussion...but I'm curious how CIFv3 deals with concurrent/overlapping queries...does the python application just queue the requests and feed them one at a time to SQLite?

wesyoung commented 7 years ago

the underlying architecture is all ZeroMQ, effectively cif-router and cif-store are two sep brokers within the framework doing as you suggest. cif-router does its part by routing the requests to cif-store, where the zmq layer handles / queues them and then sends them back to the router when they're ready. the router just takes the replies (that contain an 'original clientid') and routes them back to whatever client submitted the query..

make sense? lots of zmq in the background, which is how we get away with sqlite in a semi- multi user env.