fabric8-analytics / fabric8-analytics-common

fabric8-analytics core common development
Apache License 2.0
6 stars 42 forks source link

data-model-importer API fails to run via docker compose #19

Closed tuxdna closed 7 years ago

tuxdna commented 7 years ago

Start the services using Docker Compose as below:

$ ./docker-compose.sh up

Failure output:

data-model-importer_1   | [2017-05-29 06:59:29 +0000] [6] [INFO] Starting gunicorn 19.7.1
data-model-importer_1   | [2017-05-29 06:59:29 +0000] [6] [INFO] Listening at: http://0.0.0.0:9192 (6)
data-model-importer_1   | [2017-05-29 06:59:29 +0000] [6] [INFO] Using worker: sync
data-model-importer_1   | [2017-05-29 06:59:29 +0000] [11] [INFO] Booting worker with pid: 11
data-model-importer_1   | [2017-05-29 06:59:33 +0000] [11] [ERROR] Exception in worker process
data-model-importer_1   | Traceback (most recent call last):
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/arbiter.py", line 578, in spawn_worker
data-model-importer_1   |     worker.init_process()
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/workers/base.py", line 126, in init_process
data-model-importer_1   |     self.load_wsgi()
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/workers/base.py", line 135, in load_wsgi
data-model-importer_1   |     self.wsgi = self.app.wsgi()
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
data-model-importer_1   |     self.callable = self.load()
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/app/wsgiapp.py", line 65, in load
data-model-importer_1   |     return self.load_wsgiapp()
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp
data-model-importer_1   |     return util.import_app(self.app_uri)
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/gunicorn/util.py", line 352, in import_app
data-model-importer_1   |     __import__(module)
data-model-importer_1   |   File "/src/rest_api.py", line 23, in <module>
data-model-importer_1   |     if not BayesianGraph.is_index_created():
data-model-importer_1   |   File "/src/graph_manager.py", line 76, in is_index_created
data-model-importer_1   |     status, json_result = cls.execute(str_gremlin_dsl)
data-model-importer_1   |   File "/src/graph_manager.py", line 48, in execute
data-model-importer_1   |     data=json.dumps(payload))
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/requests/api.py", line 112, in post
data-model-importer_1   |     return request('post', url, data=data, json=json, **kwargs)
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/requests/api.py", line 58, in request
data-model-importer_1   |     return session.request(method=method, url=url, **kwargs)
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 518, in request
data-model-importer_1   |     resp = self.send(prep, **send_kwargs)
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 639, in send
data-model-importer_1   |     r = adapter.send(request, **kwargs)
data-model-importer_1   |   File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 502, in send
data-model-importer_1   |     raise ConnectionError(e, request=request)
data-model-importer_1   | ConnectionError: HTTPConnectionPool(host='bayesian-gremlin-http', port=8182): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x26b7510>: Failed to establish a new connection: [Errno 111] Connection refused',))

The issue is that data-model-importer depends on gremlin-http to be up. By the time gremlin-http starts, the importer already had failed trying to connect.

We could use some sort of a delay mechanism to delay starting of data-model-importer until gremlin-http has already started.

Related thread and SO post:

fridex commented 7 years ago

As we don't want to maintain some sort of dependencies in the docker-compose or maintaining hard-coded delays, let's bring the local setup more to the deployment, where OpenShift transparently restarts services when they fail to bring up.

Just add restart policy to the docker-compose:

    restart: always

Please open a PR for this.

Thanks!

tuxdna commented 7 years ago

@fridex It seems to me that the container isn't dying but the API inside the running container is retrying few times:

$ ./docker-compose.sh ps
+ docker-compose -f docker-compose.yml -f docker-compose.devel.yml ps
            Name                          Command               State                                    Ports                                 
----------------------------------------------------------------------------------------------------------------------------------------------
anitya-server                  /bin/sh -c /src/runinconta ...   Up       0.0.0.0:31005->5000/tcp                                               
bayesian-gremlin-http          /bin/entrypoint-local.sh         Up       0.0.0.0:8181->8182/tcp                                                
common_cvedb-s3-dump_1         /usr/local/bin/cvedb-s3-du ...   Exit 0                                                                         
common_data-model-importer_1   /bin/entrypoint.sh               Up       0.0.0.0:9192->9192/tcp                                                
common_worker-api_1            /usr/bin/workers.sh              Up                                                                             
common_worker-ingestion_1      /usr/bin/workers.sh              Up                                                                             
coreapi-broker                 /tmp/rabbitmq/run-rabbitmq ...   Up       0.0.0.0:15672->15672/tcp, 25672/tcp, 4369/tcp, 0.0.0.0:5672->5672/tcp 
coreapi-jobs                   /usr/bin/run_jobs.sh             Up       0.0.0.0:34000->34000/tcp                                              
coreapi-pgbouncer              /bin/sh -c /usr/bin/run-pg ...   Up       0.0.0.0:5432->5432/tcp                                                
coreapi-postgres               container-entrypoint run-p ...   Up       0.0.0.0:6432->5432/tcp                                                
coreapi-s3                     /usr/bin/docker-entrypoint ...   Up       0.0.0.0:33000->33000/tcp, 9000/tcp                                    
coreapi-server                 /usr/bin/coreapi-server.sh       Up       0.0.0.0:32000->5000/tcp                                               
coreapi-worker-db-migrations   /alembic/run-db-migrations.sh    Exit 0                                                                         
dynamodb                       entrypoint -sharedDb             Up       0.0.0.0:4567->4567/tcp, 0.0.0.0:8000->8000/tcp    

Do you think restart: always will work for this?

fridex commented 7 years ago

@fridex It seems to me that the container isn't dying but the API inside the running container is retrying few times:

OK, so what is the issue here? If it successfully retries after some time, things should work as expected, right?

Do you think restart: always will work for this?

The restart policy is on the docker layer. If the container does not exit, it will not affect the current behaviour anyhow.

tuxdna commented 7 years ago

@fridex It is the gunicorn [1] process that is restarting the data model importer api inside the container

From the logs [2] I can observe that data model importer attempts to connect to gremlin-http few times, after which there are no more errors.

[1] https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/master/scripts/entrypoint.sh#L4 [2] https://paste.fedoraproject.org/paste/2KmUJBY1CG4~7Qc0O~g8gF5M1UNdIGYhyRLivL9gydE=

tuxdna commented 7 years ago

And yes, the readiness probe also succeeds after waiting for a while

fridex commented 7 years ago

From the logs [2] I can observe that data model importer attempts to connect to gremlin-http few times, after which there are no more errors.

OK, so I would say we can close this as not-a-bug.

tuxdna commented 7 years ago

Yes, functionally this works. However it will be good to avoid the error that is generated for first few attempts.

From the data integrity perspective, what happens if an analysis is invoked and the data-model-importer API is not yet up ( due to the failure discussed in this issue ). I understand that all the workers will be run and data will be put into S3 ( minio ), after which the data is put into graph. But the this last step will fail because data-model-importer API is not up yet. Will the ingestion into data-mode-importer be attempted again form the same analysis after a while?

fridex commented 7 years ago

Yes, functionally this works. However it will be good to avoid the error that is generated for first few attempts.

You can see errors even for other services as well.

From the data integrity perspective, what happens if an analysis is invoked and the data-model-importer API is not yet up ( due to the failure discussed in this issue ). I understand that all the workers will be run and data will be put into S3 ( minio ), after which the data is put into graph. But the this last step will fail because data-model-importer API is not up yet. Will the ingestion into data-mode-importer be attempted again form the same analysis after a while?

No, the ingestion will not be re-scheduled.

Note that this is setup for local development. It is your responsibility to have the system up when you want to test something. Adding delays in the data-importer service will not solve your issue - analyses can still be run and the results won't be synced using data-importer anyway. Moreover, having the error message there will help you know when the whole system is up.

Anyway, there is a plan to remove data-importer service and rather do syncs using Selinon task after each analyses. There is no point to spend time on this.

tuxdna commented 7 years ago

In that case we can close this issue.