MONROE-PROJECT / Maintenance

MONROE Maintenance procedures, and mostly an issue tracker.
0 stars 0 forks source link

[Node 45] Tracking Issue #38

Closed jonakarl closed 7 years ago

jonakarl commented 8 years ago

Report in comments when and what stability tests/errors have been perfomed/discovered (attach errorlogs). When node deemed stable this issue will be closed

jonakarl commented 8 years ago

No metadata on the ZMQ bus (tested in a container). Tried to restart docker but it was utterly slow and unresponsive (the containers where still running at what felt full speed) Had no further time to debug but after a node reboot the containers receive metadata again.

kristrev commented 8 years ago

Please confirm if metadata is available outside the container or not. If it is not available putside the container I will have a look. If it is available, the error is not in the core components and outside my domain.

relet commented 8 years ago

Worksforme.

# ip netns exec monroe python sub.py
MONROE.META.DEVICE.MODEM.8946071512360089472.SIGNAL {"SequenceNumber":18054,"Timestamp":1475776989,"DataVersion":1,"DataId":"MONROE.META.DEVICE.MODEM","ICCID":"8946071512360089472","IMSI":"240020008838957","IMEI":"864154023642036","Operator":"3 SE","InternalInterface":"op2","IPAddress":"2.67.49.251","InternalIPAddress":"192.168.199.143","InterfaceName":"usb1","IMSIMCCMNC":24002,"NWMCCMNC":24002,"LAC":11700,"CID":121651732,"RSRP":-90,"Frequency":800,"RSSI":-72,"RSRQ":-8,"DeviceMode":5,"Band":20,"DeviceState":3}
MONROE.META.DEVICE.MODEM.89460850007006922138.SIGNAL {"SequenceNumber":18055,"Timestamp":1475776989,"DataVersion":1,"DataId":"MONROE.META.DEVICE.MODEM","ICCID":"89460850007006922138","IMSI":"240084710198867","IMEI":"356853051640128","Operator":"Telenor SE","IPAddress":"46.194.122.58","InterfaceName":"wwan0","IMSIMCCMNC":24008,"NWMCCMNC":24008,"LAC":65535,"CID":28717829,"RSRP":-105,"Frequency":2600,"RSSI":-75,"RSRQ":-10,"DeviceMode":5,"Band":7,"DeviceState":3}

Restarting docker is always slow and unresponsive, unless you stop the containers first. Try docker stop -t 0 $(docker ps -q) before restarting the service, otherwise it will wait for your containers to read their signals.

This is a development node running the tunnelbox container and probably should not be deployed anyway. In this case you're free to do whatever you want to make your development node available, however restarting services and the node means also hiding the underlying issue.

jonakarl commented 8 years ago

Docker stopped working approx yesterday (Mohammed sent me an email 10 hours ago that he could not log on the tunnel container). I logged on the node now and there are no containers running (docker ps). I could also see in marvind.log "Not deploying in maintenance mode."

As usual ansible was munching CPU and memory but I could also see a "copy" process for a short while. Nothing else looked suspicious, ie the disk is not full or any processes eating cpu and memory.

I will not restart the node until next week so you have time to fix what needs to be fixed.

relet commented 8 years ago

Docker failed with the following: Oct 12 07:05:20 Monroe000db94008cc docker[1536]: time="2016-10-12T07:05:20.816295835+02:00" level=error msg="Handler for DELETE /v1.24/containers/879e552dbd1a returned error: Driver devicemapper failed to remove root filesystem 879e552dbd1a8347473c48ac704d063fe47728d4e16913bd73ca88ad6ac52d27: Device is Busy"

This is not automatically recoverable, so the node went into maintenance mode (stopping docker containers and scheduling tasks). The proper fix happens when we move to boot OS, and use the production-level devicemapper-lvm driver for docker. I will try to reboot, or delete the docker disk and reboot.

relet commented 8 years ago

This is an issue that apparently affected all nodes. I have most, if not all of them in maintenance mode.

relet commented 8 years ago

It is recovered when rebooting, I will try if I can find another gentler solution.