anyproto / any-sync-dockercompose

docker-compose for testing any-sync
MIT License
234 stars 29 forks source link

Millions of DNS requests looking up `mongo-1` #54

Closed greyivy closed 2 weeks ago

greyivy commented 1 month ago

Have you read a contributing guide?

Current Behavior

After the app runs for a while, it starts making millions of DNS requests to my main DNS server and the logs become filled with the following. My AnyType client still appears to be fully functional.

any-sync-node-3-1         | 2024-05-15T22:50:55.564Z    WARN    node.nodespace.spacedeleter periodic call error {"error": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }", "errorVerbose": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:230\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:250"}
any-sync-node-2-1         | 2024-05-15T22:50:56.135Z    ERROR   node.nodespace.spacedeleter deletion process failed {"error": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }", "errorVerbose": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:230\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:250"}
any-sync-coordinator-1    | 2024-05-15T22:50:56.134Z    INFO    rpcLog      {"totalDur": 30.000558081, "addr": "yamux://172.19.0.8:40300", "error": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }", "app": "v0.3.25", "peerId": [redacted], "identity": "[redacted], "rpc": "coordinator.deletionLog", "peerVersion": "any-sync-node:v0.3.32/any-sync:v0.4.13"}
any-sync-node-2-1         | 2024-05-15T22:50:56.135Z    WARN    node.nodespace.spacedeleter periodic call error {"error": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }", "errorVerbose": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:230\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:250"}
any-sync-coordinator-1    | 2024-05-15T22:50:57.948Z    INFO    coordinator.acl GC: removed 0; cache size: 0
any-sync-filenode-1       | 2024-05-15T22:50:58.511Z    WARN    filenode.deletionLog    periodic call error {"error": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }", "errorVerbose": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }\n\tstorj.io/drpc/drpcwire.UnmarshalError:26\n\tstorj.io/drpc/drpcstream.(*Stream).HandlePacket:230\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:250"}
any-sync-coordinator-1    | 2024-05-15T22:50:58.510Z    INFO    rpcLog      {"totalDur": 30.000372531, "addr": "yamux://172.19.0.3:56196", "error": "server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: mongo-1:27001, Type: Unknown, Last error: dial tcp: lookup mongo-1 on 127.0.0.11:53: no such host }, ] }", "app": "v0.3.25", "peerId": [redacted], "identity": [redacted], "rpc": "coordinator.deletionLog", "peerVersion": "any-sync-filenode:v0.6.5/any-sync:v0.4.10"}

Expected Behavior

The above error does not occur -- or if it does -- it does not generate millions of DNS requests.

Steps To Reproduce

docker-compose.override.yml:

services:
  mongo-1:
    volumes:
      - ./storage/mongo-1/:/data/db
      - /root/anytype-backup-mongo/:/data/backup
    healthcheck:
      test: echo "rs.initiate({_id:'${MONGO_REPLICA_SET}',members:[{_id:0,host:\"mongo-1:${MONGO_1_PORT}\"}]}).ok || rs.status().ok" | mongo --port ${MONGO_1_PORT} --quiet
      interval: 900s

Note: I changed the interval in hopes that it would help solve this issue but it did not help.

Environment

- OS: Linux 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

Anything else?

No response

fb929 commented 1 month ago

I cannot reproduce this. Can you determine which daemons (or from which containers) are generating this DNS traffic?

greyivy commented 1 month ago

It seems to be any-sync-consensusnode (or -- it stops when stopping the consensus node. Stopping any other container doesn't seem to make a difference).

Interestingly, restarting the container alone doesn't seem to fix the problem. I have to run make restart.

fb929 commented 1 month ago

Can you show me the logs of mongo-1 and any-sync-consensusnode at the moment of the problem?

docker-compose logs mongo-1
docker-compose logs any-sync-consensusnode

and also ouput this commands:

docker compose exec any-sync-consensusnode getent hosts mongo-1
greyivy commented 3 weeks ago

Sorry for the delay on this. It turns out that the MongoDB instance is actually shutting down and not restarting. I'm not sure what's sending it the signal to shut down though. Here's the log:

{"log":"{\"t\":{\"$date\":\"2024-06-07T11:29:04.284+00:00\"},\"s\":\"I\",  \"c\":\"STORAGE\",  \"id\":22430,   \"ctx\":\"WTCheckpointThread\",\"msg\":\"WiredTiger message\",\"attr\":{\"message\":\"[1717759744:284833][1:0x7f3ae4d23700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 19186, snapshot max: 19186 snapshot count: 0, oldest timestamp: (1717759729, 1) , meta checkpoint timestamp: (1717759734, 1) base write gen: 124169\"}}\n","stream":"stdout","time":"2024-06-07T11:29:04.285002854Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:29:56.863+00:00\"},\"s\":\"I\",  \"c\":\"NETWORK\",  \"id\":22943,   \"ctx\":\"listener\",\"msg\":\"Connection accepted\",\"attr\":{\"remote\":\"127.0.0.1:42548\",\"connectionId\":331,\"connectionCount\":13}}\n","stream":"stdout","time":"2024-06-07T11:29:56.863331392Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:29:56.863+00:00\"},\"s\":\"I\",  \"c\":\"NETWORK\",  \"id\":51800,   \"ctx\":\"conn331\",\"msg\":\"client metadata\",\"attr\":{\"remote\":\"127.0.0.1:42548\",\"client\":\"conn331\",\"doc\":{\"application\":{\"name\":\"MongoDB Shell\"},\"driver\":{\"name\":\"MongoDB Internal Client\",\"version\":\"4.4.29\"},\"os\":{\"type\":\"Linux\",\"name\":\"Ubuntu\",\"architecture\":\"x86_64\",\"version\":\"20.04\"}}}}\n","stream":"stdout","time":"2024-06-07T11:29:56.863883533Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:29:56.869+00:00\"},\"s\":\"I\",  \"c\":\"REPL\",     \"id\":21356,   \"ctx\":\"conn331\",\"msg\":\"replSetInitiate admin command received from client\"}\n","stream":"stdout","time":"2024-06-07T11:29:56.869886194Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:29:56.873+00:00\"},\"s\":\"I\",  \"c\":\"NETWORK\",  \"id\":22944,   \"ctx\":\"conn331\",\"msg\":\"Connection ended\",\"attr\":{\"remote\":\"127.0.0.1:42548\",\"connectionId\":331,\"connectionCount\":12}}\n","stream":"stdout","time":"2024-06-07T11:29:56.873223696Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:04.327+00:00\"},\"s\":\"I\",  \"c\":\"STORAGE\",  \"id\":22430,   \"ctx\":\"WTCheckpointThread\",\"msg\":\"WiredTiger message\",\"attr\":{\"message\":\"[1717759804:327809][1:0x7f3ae4d23700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 19201, snapshot max: 19201 snapshot count: 0, oldest timestamp: (1717759789, 1) , meta checkpoint timestamp: (1717759794, 1) base write gen: 124169\"}}\n","stream":"stdout","time":"2024-06-07T11:30:04.327977934Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.633+00:00\"},\"s\":\"I\",  \"c\":\"CONTROL\",  \"id\":23377,   \"ctx\":\"SignalHandler\",\"msg\":\"Received signal\",\"attr\":{\"signal\":15,\"error\":\"Terminated\"}}\n","stream":"stdout","time":"2024-06-07T11:30:06.63750487Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.636+00:00\"},\"s\":\"I\",  \"c\":\"CONTROL\",  \"id\":23378,   \"ctx\":\"SignalHandler\",\"msg\":\"Signal was sent by kill(2)\",\"attr\":{\"pid\":0,\"uid\":0}}\n","stream":"stdout","time":"2024-06-07T11:30:06.637554041Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.640+00:00\"},\"s\":\"I\",  \"c\":\"CONTROL\",  \"id\":23381,   \"ctx\":\"SignalHandler\",\"msg\":\"will terminate after current cmd ends\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.643749122Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.643+00:00\"},\"s\":\"I\",  \"c\":\"REPL\",     \"id\":4784900, \"ctx\":\"SignalHandler\",\"msg\":\"Stepping down the ReplicationCoordinator for shutdown\",\"attr\":{\"waitTimeMillis\":10000}}\n","stream":"stdout","time":"2024-06-07T11:30:06.644203248Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.644+00:00\"},\"s\":\"I\",  \"c\":\"COMMAND\",  \"id\":4784901, \"ctx\":\"SignalHandler\",\"msg\":\"Shutting down the MirrorMaestro\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.645004389Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.644+00:00\"},\"s\":\"I\",  \"c\":\"REPL\",     \"id\":40441,   \"ctx\":\"SignalHandler\",\"msg\":\"Stopping TopologyVersionObserver\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.645024635Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.670+00:00\"},\"s\":\"I\",  \"c\":\"REPL\",     \"id\":40447,   \"ctx\":\"TopologyVersionObserver\",\"msg\":\"Stopped TopologyVersionObserver\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.671023172Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.670+00:00\"},\"s\":\"I\",  \"c\":\"SHARDING\", \"id\":4784902, \"ctx\":\"SignalHandler\",\"msg\":\"Shutting down the WaitForMajorityService\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.671063454Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.684+00:00\"},\"s\":\"I\",  \"c\":\"CONTROL\",  \"id\":4784903, \"ctx\":\"SignalHandler\",\"msg\":\"Shutting down the LogicalSessionCache\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.685776668Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.691+00:00\"},\"s\":\"I\",  \"c\":\"NETWORK\",  \"id\":20562,   \"ctx\":\"SignalHandler\",\"msg\":\"Shutdown: going to close listening sockets\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.691264825Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.692+00:00\"},\"s\":\"I\",  \"c\":\"NETWORK\",  \"id\":23017,   \"ctx\":\"listener\",\"msg\":\"removing socket file\",\"attr\":{\"path\":\"/tmp/mongodb-27001.sock\"}}\n","stream":"stdout","time":"2024-06-07T11:30:06.692398477Z"}
{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.695+00:00\"},\"s\":\"I\",  \"c\":\"NETWORK\",  \"id\":4784905, \"ctx\":\"SignalHandler\",\"msg\":\"Shutting down the global connection pool\"}\n","stream":"stdout","time":"2024-06-07T11:30:06.695686185Z"}

...shutdown messages continue

The relevant line is:

{"log":"{\"t\":{\"$date\":\"2024-06-07T11:30:06.633+00:00\"},\"s\":\"I\",  \"c\":\"CONTROL\",  \"id\":23377,   \"ctx\":\"SignalHandler\",\"msg\":\"Received signal\",\"attr\":{\"signal\":15,\"error\":\"Terminated\"}}\n","stream":"stdout","time":"2024-06-07T11:30:06.63750487Z"}

This is a fresh Docker install and AnyType is the only thing running. None of the other containers are killed.

fb929 commented 3 weeks ago

could you please collect the debug information?

  1. system info:
    cat /proc/cpuinfo
    free -m
  2. check the system logs for OOM (Out of Memory) issues:
    dmesg -T | grep -i oom
  3. show the output of the "inspect" command after the MongoDB container stops unexpectedly:
    docker inspect <mongo_container_id>

    to find the ID of the stopped container, you can use the following command:

    docker ps -a | grep mongo-1
greyivy commented 2 weeks ago

It was indeed an OOM, and totally unrelated. Thank you for looking into it.

fb929 commented 2 weeks ago

@greyivy I would be very grateful if you could share the parameters of your system. At the moment, we are investigating the issue of "minimum requirements" for cluster operation, and this data will be very helpful to us.

greyivy commented 1 week ago

Sure thing!

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 156
model name  : Intel(R) Celeron(R) N5095 @ 2.00GHz
stepping    : 0
microcode   : 0x24000024
cpu MHz     : 1996.800
cache size  : 4096 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand hypervisor lahf_lm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms rdseed smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves arat umip waitpkg gfni rdpid movdiri movdir64b md_clear arch_capabilities
vmx flags   : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml tsc_scaling
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data rfds bhi
bogomips    : 3993.60
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
               total        used        free      shared  buff/cache   available
Mem:            1962         515         108          17        1539        1447
Swap:              0           0           0

(Note that I tried with 3GB and 4GB as well with no change, but I was running other memory-intensive processes on the machine at the time. I'll give it another shot again in the future when I can dedicate more time to troubleshooting.)