Monitor and Restore server after Redis fails

AcckiyGerman commented 6 years ago

After pushing huge amounts of data, or pushing several datasets simultaneously - the Redis server could hang and all data processing are impossible. Last time we notice that after 3 hours after the accident. (see https://github.com/datahq/datahub-qa/issues/79 )

Acceptance criteria

[ ] ~~there is a monitoring system that check if the server and all related daemons are running well~~
- Redis
- web-server
- ... what else?
[ ] ~~monitoring system notifies administrator if something crashes~~
[ ] ~~(optional) monitoring system restore failed daemons or docker instances or whatever~~
[x] Factory workflow is not stuck cause of redis

Tasks

[ ] ~~chose the proper monitoring system~~
[ ] ~~implement monitoring~~
[x] refactor so that we don't need redis

Tests:

[ ] try to break the server with
- DDOS (multi-requests to dataset-pages)
- data push HUGE data in several threads
[ ] get a proper message from monitoring
[ ] server should be restored automatically

Analysis

zelima commented 6 years ago

We got rid of Redis service at all and integrated data factory into the flowmanager

AcckiyGerman commented 6 years ago

@zelima do we have some kind of monitoring for another system parts? How do we know that, say, we're out of memory?

zelima commented 6 years ago

@AcckiyGerman we used to have datadog, but now terminated as out of trial period

zelima commented 6 years ago

@AcckiyGerman tip: tick the appropriate checkboxes in the issue description when closing, or add a comment if not ticked

AcckiyGerman commented 6 years ago

@zelima but it was you, who closed the issue ;) and, by the way, Acceptance criteria is not reached.

zelima commented 6 years ago

@AcckiyGerman correct - tip for myself :) Updated description appropriately

AcckiyGerman commented 6 years ago

@zelima I'd like to reopen this issue (and rename to "Monitor and Restore datahub.io server Modules (web, pipeline, specstore, etc)"

Coz as I see we still don't know either the server is running OK or NOT until somebody tries to get some page and fail.

AcckiyGerman commented 6 years ago

FIXED: This issue was about Redis problems. Open issue about general server problems here: https://github.com/datahq/pm/issues/122

datahubio / datahub-v2-pm