Query server frequently offline

jwilling commented 7 years ago

Hey! First of all, thanks to all of the contributors for the work done on this project. It's been extremely helpful and has saved me quite a lot of time.

I'm having an issue with the status commands through mscs. My server has two worlds running on non-standard ports. They both work just fine, and can be queried using an online tool with the expected response. When I try to query using mscs status, however, I tend to get this after the server has been running for about an hour (with a couple users online):

$ mscs status
Minecraft Server Status:
  wookies: running (query server offline).
    Process ID: 17617.
    Memory used: 1338836 kB.
  wookiescreative: running version 1.12 (0 of 20 users online).
    Process ID: 17816.
    Memory used: 1260216 kB.

Any idea what's causing this?

sandain commented 7 years ago

Hi @jwilling. I'm not sure what is going on here. It sounds like you have tried everything I would recommend.

Here is how it works: The script starts a tail process (line 1595 in the queryStart method) that pipes anything that ends up in the query.in file to the Query server via socat. Responses from the Query server via socat are piped into the query.out file. Other methods then call the method querySendPacket that packs together a Query packet and writes it to the query.in file and reads and unpacks the response from the query.out file. If the tail or socat process have issues for some reason, the Query server will be listed as offline.

Perhaps you have a permission issue on one of the two in/out files? I don't know.

Roflicide commented 7 years ago

Hey @jwilling,

Sorry that you're having issues. Hopefully we can try to resolve it ASAP. A few questions...

Are you having issues using mscs status with only the wookies world? Or does it occur to other worlds as well? Perhaps try creating another world and letting it run for an hour, and then see if the issue also occurs when you try to get the status of that world; this could help tell us if it's world-specific or if there's something else going on. The second thing I would like to check is whether or not the amount of users online has something affecting it; did you find that mscs status works when, say, there's less than 3 users online and then stops working when there's more than that? Perhaps the issue centers around how many users are online.
Also, are you running MSCS in a multi-user environment / what user(s) are you using to running MSCS? (That is, if you didn't follow the default installation). It could be a permission issue as @sandain noted

Sorry that we can't give you a clear cut answer; right now we have to isolate the problem and then figure it out at this point.

jwilling commented 7 years ago

Huge thanks @sandain and @Roflicide for your assistance.

So my original use case for needing the status command to function was for a status page. I'd gather the player data from the query-json command and pipe it into a static page for checking who was online. This job ran every minute. I'm sure there are better ways to do this, but I just wanted something quick and dirty.

Since I was having issues with the reliability of the variants of status, I switched to mcstatus to get the query information I needed. Interestingly, ever since I switched to using mcstatus, the status query in mscs functions as expected.

It definitely could be something related to user permissions, although as far as I can remember I ran everything from the same user. I definitely did do a custom installation, as I wanted everything contained within my home directory, in addition to having a different username. And yes, it only occurred with one specific world.

If this ever happens again I'll try to figure out what variables might be at play. That being said, it seems like I'm the only one that had this problem and my use cases are fairly specific, so we could close this and I could reopen it if it occurs later, if you'd prefer.

conrad784 commented 7 years ago

I had this issue too, but everytime this occured I ran a second query and it came back normal so I didn't evaluate what goes wrong. This happend more often on modded servers running with Forge, vanilla-server querys where as long as I remember fine. If this occures again I will report back.

sandain commented 7 years ago

If you notice, the code in the queryStart method that I linked to above includes a 20 second delay before starting the Query handler code (that uses tail and socat). This means that for the first 20 seconds after a server starts up, the expected response is "query server offline". If you are still getting that response after 20 seconds infrequently there is probably no issue -- if the querySendPacket method fails to decode a response from the server, it won't try again unless you ask it and this happening infrequently should be no problem. If this is happening more often than infrequently, then there is an issue.

We could make the querySendPacket method more robust to hickups by adding in a short delay followed by resending the packet when a response packet from the server isn't decoded properly if everyone thinks it is necessary.

Roflicide commented 7 years ago

@conrad784 thank you for the input, you are right when it seems to be happening to forge servers.

~~@sandain I have confirmed that this seems to be a bug on forge servers. To reproduce:~~

~~1. Create a new forge server~~ ~~2. Start it~~ ~~3. Run a mscs status command~~

~~The first time you run the status, it will list the query server as offline. The second time you run it, the command works and it appears to correctly query it.~~

~~This issue seems to occur every time after stopping/starting the forge world, and it seems to fix itself after the first query.~~

EDIT: after further investigation, @sandain seems to be right. This issue only occurs within the first 20 seconds of server startup, as far as i can tell.

sandain commented 7 years ago

I'm going to go ahead and mark this bug as invalid and close it. If anyone is able to come up with a condition where the "query server offline" message is received after the initial 20 second startup time, please reopen this bug.

Aersaud commented 6 years ago

Having this issue as well, even after the first 20 seconds. What do you need from me to figure this out?

Roflicide commented 6 years ago

@Aersaud

Are you running a vanilla server or with mods? What version of minecraft? Does it happen to all of your worlds? Did you follow the default installation process? Is there any output in the console?

Aersaud commented 6 years ago

Running with mods, forge and sponge. Recommended version for both of those. Also a few other mods, if you like I can name them. However, when testing this I ran the server modless, a side from forge and sponge. We only have 1 world so I cannot check any others unfortunately. Followed the default install, including server backups. When this happens the server works fine and there are no errors in either the linux console or the game console. It also ends up resolving and fixing itself after some time (I've seen about 5-10 minutes) sometimes it never resolves and I have to keep starting and stopping the server for it to eventually work. I am checking it using the mscs status [world] command.

sandain commented 6 years ago

Please post a copy of your latest.log, and the query.in and query.out files while you are experiencing this issue. I'll see if I can track it down.

Aersaud commented 6 years ago

Ok so while my server was running the query server went offline. I cp'ing the files now to SFTP them over here and post. @sandain latest.log queries.zip

Both query in and out are in the queries.zip Virus Total in case any one is worried: https://www.virustotal.com/#/file/49408761d51af5493245a7b132b8fce0adb0eb8b8c4f77b54fba533dd4e09586/detection

sandain commented 6 years ago

Well that didn't really help. The latest.log shows that the query server is running and I see no error messages that would note the server is having issues with query.

[10:16:35] [Query Listener #1/INFO]: Query running on 0.0.0.0:25565

I am curious what this error message means:

java.io.EOFException: null
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268) ~[?:1.8.0_162]
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258) ~[?:1.8.0_162]
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) ~[?:1.8.0_162]
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) ~[?:1.8.0_162]
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) ~[?:1.8.0_162]
    at net.minecraft.nbt.CompressedStreamTools.func_74796_a(CompressedStreamTools.java:26) ~[gi.class:?]
    at net.minecraft.world.storage.SaveFormatOld.loadAndFix(SaveFormatOld.java:117) [bfc.class:?]
    at net.minecraft.world.storage.SaveHandler.redirect$onLoadWorldInfo$bbi000(SaveHandler.java:1068) [bfb.class:?]
    at net.minecraft.world.storage.SaveHandler.func_75757_d(SaveHandler.java:122) [bfb.class:?]
    at org.spongepowered.common.world.WorldManager.loadAllWorlds(WorldManager.java:706) [WorldManager.class:1.12.2-2611-8.0.0-BETA-3018]
    at net.minecraft.server.MinecraftServer.func_71247_a(MinecraftServer.java:3530) [MinecraftServer.class:?]
    at net.minecraft.server.dedicated.DedicatedServer.func_71197_b(DedicatedServer.java:270) [nz.class:?]
    at net.minecraft.server.MinecraftServer.run(MinecraftServer.java:484) [MinecraftServer.class:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]

It doesn't look like it should break query, but maybe you should look into what is causing that.

The query.in file looks correct, but the query.out file is empty meaning commands were being sent but no response was received for some reason. Check which processes are currently running when you get this problem (run ps -A). You should have a couple tail processes and a socat process running -- tail is used to watch for changes to the query.in file and pipe it to socat, and socat sends the query to the server and pipes the response to the query.out file.

Aersaud commented 6 years ago

mcadmin@tkmminecraft:~$ ps -A
  PID TTY          TIME CMD
    1 ?        00:00:03 systemd
    2 ?        00:00:00 kthreadd
    3 ?        00:00:00 ksoftirqd/0
    5 ?        00:00:00 kworker/0:0H
    7 ?        00:00:21 rcu_sched
    8 ?        00:00:00 rcu_bh
    9 ?        00:00:00 migration/0
   10 ?        00:00:01 watchdog/0
   11 ?        00:00:01 watchdog/1
   12 ?        00:00:00 migration/1
   13 ?        00:00:01 ksoftirqd/1
   15 ?        00:00:00 kworker/1:0H
   16 ?        00:00:01 watchdog/2
   17 ?        00:00:00 migration/2
   18 ?        00:00:00 ksoftirqd/2
   20 ?        00:00:00 kworker/2:0H
   21 ?        00:00:01 watchdog/3
   22 ?        00:00:00 migration/3
   23 ?        00:00:18 ksoftirqd/3
   25 ?        00:00:00 kworker/3:0H
   26 ?        00:00:00 kdevtmpfs
   27 ?        00:00:00 netns
   28 ?        00:00:00 perf
   29 ?        00:00:00 khungtaskd
   30 ?        00:00:00 writeback
   31 ?        00:00:00 ksmd
   32 ?        00:00:03 khugepaged
   33 ?        00:00:00 crypto
   34 ?        00:00:00 kintegrityd
   35 ?        00:00:00 bioset
   36 ?        00:00:00 kblockd
   37 ?        00:00:00 ata_sff
   38 ?        00:00:00 md
   39 ?        00:00:00 devfreq_wq
   44 ?        00:00:00 kswapd0
   45 ?        00:00:00 vmstat
   46 ?        00:00:00 fsnotify_mark
   47 ?        00:00:00 ecryptfs-kthrea
   63 ?        00:00:00 kthrotld
   64 ?        00:00:00 acpi_thermal_pm
   65 ?        00:00:00 bioset
   66 ?        00:00:00 bioset
   67 ?        00:00:00 bioset
   68 ?        00:00:00 bioset
   69 ?        00:00:00 bioset
   70 ?        00:00:00 bioset
   71 ?        00:00:00 bioset
   72 ?        00:00:00 bioset
   73 ?        00:00:00 scsi_eh_0
   74 ?        00:00:00 scsi_tmf_0
   75 ?        00:00:00 scsi_eh_1
   76 ?        00:00:00 scsi_tmf_1
   82 ?        00:00:00 ipv6_addrconf
   95 ?        00:00:00 deferwq
   96 ?        00:00:00 charger_manager
   97 ?        00:00:00 bioset
  151 ?        00:00:00 kpsmoused
  153 ?        00:00:00 scsi_eh_2
  155 ?        00:00:00 scsi_tmf_2
  192 ?        00:00:00 kworker/0:1H
  194 ?        00:00:00 bioset
  269 ?        00:00:00 raid5wq
  293 ?        00:00:00 kdmflush
  295 ?        00:00:00 bioset
  304 ?        00:00:00 kdmflush
  305 ?        00:00:00 bioset
  317 ?        00:00:00 bioset
  342 ?        00:00:07 jbd2/dm-0-8
  343 ?        00:00:00 ext4-rsv-conver
  400 ?        00:00:00 kworker/2:1H
  408 ?        00:00:00 iscsi_eh
  413 ?        00:00:00 systemd-journal
  420 ?        00:00:00 ib_addr
  422 ?        00:00:00 kauditd
  433 ?        00:00:00 ib_mcast
  435 ?        00:00:00 ib_nl_sa_wq
  437 ?        00:00:00 ib_cm
  441 ?        00:00:00 iw_cm_wq
  442 ?        00:00:00 rdma_cm
  445 ?        00:00:00 lvmetad
  526 ?        00:00:03 kworker/1:1H
  580 ?        00:00:00 iprt-VBoxWQueue
  730 ?        00:00:00 kworker/3:1H
  788 ?        00:00:00 ttm_swap
  796 ?        00:00:00 ext4-rsv-conver
  936 ?        00:00:00 rsyslogd
  938 ?        00:00:00 lxcfs
  940 ?        00:00:00 atd
  942 ?        00:00:00 cron
  946 ?        00:00:00 acpid
  954 ?        00:00:00 dbus-daemon
  975 ?        00:00:00 systemd-logind
  979 ?        00:00:10 snapd
  981 ?        00:00:03 accounts-daemon
  992 ?        00:00:00 mdadm
 1000 ?        00:00:00 polkitd
 1079 ?        00:00:00 dhclient
 1139 ?        00:00:00 sshd
 1163 ?        00:00:07 iscsid
 1164 ?        00:00:32 iscsid
 1217 ?        00:00:07 irqbalance
 1227 tty1     00:00:00 agetty
 1528 ?        00:00:00 systemd
 1531 ?        00:00:00 (sd-pam)
 5579 ?        00:00:21 kworker/0:2
 9818 ?        00:00:00 sh
 9820 ?        00:00:02 tail
 9821 ?        00:00:00 sh
 9822 ?        00:33:56 java
10130 ?        00:00:01 kworker/3:0
10709 ?        00:00:11 kworker/0:0
11687 ?        00:00:00 kworker/2:1
13008 ?        00:00:00 kworker/u8:1
13012 ?        00:00:00 kworker/u8:0
13020 ?        00:00:00 kworker/u8:2
13064 ?        00:00:00 kworker/3:1
15560 ?        00:00:00 kworker/1:2
15897 ?        00:00:00 kworker/2:2
15898 ?        00:00:00 kworker/2:3
15906 ?        00:00:00 sshd
15937 ?        00:00:00 sshd
15938 pts/0    00:00:00 bash
17062 ?        00:00:00 systemd-timesyn
17064 ?        00:00:00 kworker/0:1
19820 ?        00:00:00 systemd-udevd
20357 ?        00:00:00 xfsalloc
20358 ?        00:00:00 xfs_mru_cache
20363 ?        00:00:00 jfsIO
20364 ?        00:00:00 jfsCommit
20365 ?        00:00:00 jfsCommit
20366 ?        00:00:00 jfsCommit
20367 ?        00:00:00 jfsCommit
20368 ?        00:00:00 jfsSync
20895 ?        00:00:04 kworker/1:1
21819 ?        00:00:00 kworker/2:0
25742 ?        00:00:02 kworker/1:0
28013 pts/0    00:00:00 ps

I see a tail process, but not seeing socat.

It's happening again by the way.

mcadmin@tkmminecraft:~$ mscs status TKM-MadCraft
Minecraft Server Status:
  TKM-MadCraft: running (query server offline).
    Process ID: 9822.
    Memory used: 1389316 kB.

Permissions:

mcadmin@tkmminecraft:/opt/mscs/worlds/TKM-MadCraft$ ls -la
-rw-r--r--  1 minecraft minecraft       44 Apr 14 13:18 query.in
-rw-r--r--  1 minecraft minecraft        0 Apr 14 13:18 query.out

@sandain

Aersaud commented 6 years ago

Also noticing this:

[02:19:23] [Server thread/ERROR]: Exception reading ./TKM-MadCraft/DIM1/level.dat
java.io.EOFException: null
        at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268) ~[?:1.8.0_162]
        at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258) ~[?:1.8.0_162]
        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) ~[?:1.8.0_162]
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) ~[?:1.8.0_162]
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) ~[?:1.8.0_162]
        at net.minecraft.nbt.CompressedStreamTools.func_74796_a(CompressedStreamTools.java:26) ~[gi.class:?]
        at net.minecraft.world.storage.SaveFormatOld.loadAndFix(SaveFormatOld.java:117) [bfc.class:?]
        at net.minecraft.world.storage.SaveHandler.redirect$onLoadWorldInfo$bbi000(SaveHandler.java:1068) [bfb.class:?]
        at net.minecraft.world.storage.SaveHandler.func_75757_d(SaveHandler.java:122) [bfb.class:?]
        at org.spongepowered.common.world.WorldManager.loadAllWorlds(WorldManager.java:706) [WorldManager.class:1.12.2-2611-7.1.0-BETA-3019]
        at net.minecraft.server.MinecraftServer.func_71247_a(MinecraftServer.java:3530) [MinecraftServer.class:?]
        at net.minecraft.server.dedicated.DedicatedServer.func_71197_b(DedicatedServer.java:270) [nz.class:?]
        at net.minecraft.server.MinecraftServer.run(MinecraftServer.java:484) [MinecraftServer.class:?]

-rw-r--r--  1 minecraft minecraft    0 Apr 11 02:35 level.dat

sandain commented 6 years ago

You only have one tail not two, and no socat. That is definitely part of the problem. The question is why that is the case.

I noticed that error also. You might want to look into what is causing that, it is not the script, but might be behind the issue with query.

Aersaud commented 6 years ago

I found the error from the Sponge github. Apparently its a corrupted level.dat - I wonder if this may be causing it. @sandain

UPDATE!

Removed the corrupt level.dat cp'd level.dat_old to level.dat, started server. No more exception reading error and server is currently showing query working.

Will let it run for a while and check back on it.

Aersaud commented 6 years ago

Ok just another update. Left it running all night and it did the same thing. Query server went offline randomly.

Roflicide commented 6 years ago

Hi again @Aersaud,

I'm sorry you're having an issue. I suspect what is going on is that socat's default timeout (0.5 s) is too low for certain server environments; however, this is just a guess. Let's see what exactly is going on. If you could, please do the following:

Stop the world.
Type which msctl to get the location of the script file. In my case it was /usr/local/bin/msctl

Edit line 1602 of that file, and replace $SOCAT - with $SOCAT -d -d -d -lf \"$WORLD_DIR/socat.log\" -

So now lines 1599-1603 should look like this:

nohup sh -c "
sleep 20;
tail -f --pid=$SERVER_PID \"$WORLD_DIR/query.in\" |
$SOCAT -d -d -d -lf \"$WORLD_DIR/socat.log\" - UDP:$SERVER_IP:$QUERY_PORT > \"$WORLD_DIR/query.out\"
" > /dev/null 2>&1 &

Start the world. About a minute after starting it, there should be a file created called socat.log in the world directory (/opt/mscs/worlds/<yourworld>). This will log all of the information on what socat is doing to that file.

Once you've followed the above steps, please execute the status queries as you normally would. Once the issue comes, paste the contents of the socat.log file here so we can see why exactly socat is terminating.

Thanks!

sandain commented 6 years ago

Thanks for the help @Roflicide, I wasn't sure how to debug this issue further.

Aersaud commented 6 years ago

Just applied your change to the msctl file. Same location as yours. Server is starting now will monitor and post when it happens. @sandain @Roflicide

Aersaud commented 6 years ago

This is from the latest time it went down.


2018/04/21 23:22:42 socat[4349] I This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. $
2018/04/21 23:22:42 socat[4349] I This product includes software written by Tim Hudson (tjh@cryptsoft.com)
2018/04/21 23:22:42 socat[4349] N reading from and writing to stdio
2018/04/21 23:22:42 socat[4349] N opening connection to AF=2 127.0.0.1:25566
2018/04/21 23:22:42 socat[4349] I starting connect loop
2018/04/21 23:22:42 socat[4349] I socket(2, 2, 17) -> 6
2018/04/21 23:22:42 socat[4349] N successfully connected from local address AF=2 127.0.0.1:57655
2018/04/21 23:22:42 socat[4349] I resolved and opened all sock addresses
2018/04/21 23:22:42 socat[4349] N starting data transfer loop with FDs [0,1] and [6,6]
2018/04/21 23:23:12 socat[4349] I transferred 11 bytes from 0 to 6
2018/04/21 23:23:12 socat[4349] E read(6, 0x84c100, 8192): Connection refused
2018/04/21 23:23:12 socat[4349] N exit(1)
2018/04/21 23:23:12 socat[4349] I shutdown(6, 2)

mcadmin@tkmminecraft:/opt/mscs/worlds$ mscs status
Minecraft Server Status:
  TKM-MadCraft: running (query server offline).
    Process ID: 2421.
    Memory used: 1584004 kB.

Roflicide commented 6 years ago

For some reason it looks like it's opening a connection to port 25566 even though your query server seems to be running on port 25565 based on the latest.log you uploaded earlier. This would explain why it's failing to read a connection.

Can you paste the contents of your server.properties file here? Specifically, I want to see what the values are (or if the values are even there!) for the properties query.port, query-enabled, and server-port.

Aersaud commented 6 years ago

Yes sorry forgot that I changed a few things. It is now running on port 25566. Here is the updated server.properties.


#Minecraft server properties
#Sat Apr 21 23:23:09 EDT 2018
spawn-protection=16
max-tick-time=60000
query.port=25566
generator-settings=
force-gamemode=false
allow-nether=true
gamemode=0
broadcast-console-to-ops=true
enable-query=true
player-idle-timeout=0
difficulty=2
spawn-monsters=true
op-permission-level=4
pvp=true
snooper-enabled=true
level-type=DEFAULT
hardcore=false
enable-command-block=false
max-players=25
network-compression-threshold=256
resource-pack-sha1=
max-world-size=29999984
rcon.port=25575
server-port=25566
server-ip=
spawn-npcs=true
allow-flight=false
level-name=TKM-MadCraft
view-distance=10
resource-pack=
spawn-animals=true
white-list=false
rcon.password=CHANGEDFORPRIVACY
generate-structures=true
max-build-height=256
online-mode=false
level-seed=
prevent-proxy-connections=false
use-native-transport=true
motd="Welcome to TKM-MadCraft\!"
enable-rcon=true

@Roflicide

This was set before the error.

Roflicide commented 6 years ago

@Aersaud So before the error the query.port and server.port were both set to 25566? Or were they set to something else?

sandain commented 6 years ago

Let's try another modification to the script in the same place. Add the -s option to socat to make it 'sloppy with errors'.

This should make lines 1599-1603 look like:

nohup sh -c "
  sleep 20;
  tail -f --pid=$SERVER_PID \"$WORLD_DIR/query.in\" |
  $SOCAT -d -d -d -lf \"$WORLD_DIR/socat.log\" -s - UDP:$SERVER_IP:$QUERY_PORT > \"$WORLD_DIR/query.out\"
" > /dev/null 2>&1 &

A more complicated option would be to explore the -W option. This would require wrapping the code to start the query handler in a while loop that checks for the existence of the lockfile declared with the -W option and the current state of the server. If the server is running, but the lockfile is missing, the query handler would be restarted. This option requires some coding, so lets give the simple 'sloppy' option above a try first.

Aersaud commented 6 years ago

Just updated my msctl file using nano. It is now displaying

nohup sh -c "
    sleep 20;
    tail -f --pid=$SERVER_PID \"$WORLD_DIR/query.in\" |
    $SOCAT -d -d -d -lf \"$WORLD_DIR/socat.log\" -s - UDP:$SERVER_IP:$QUERY_PORT > \"$WORLD_DIR/query.out\"
  " > /dev/null 2>&1 &

Whats next?

sandain commented 6 years ago

Restart your server with that change in place and let it run until query ends up failing (or not). Post the socat.log if it fails.

Aersaud commented 6 years ago

2018/04/22 21:46:48 socat[15897] N opening connection to AF=2 127.0.0.1:25566
2018/04/22 21:46:48 socat[15897] I starting connect loop
2018/04/22 21:46:48 socat[15897] I socket(2, 2, 17) -> 6
2018/04/22 21:46:48 socat[15897] N successfully connected from local address AF=2 127.0.0.1:58939
2018/04/22 21:46:48 socat[15897] I resolved and opened all sock addresses
2018/04/22 21:46:48 socat[15897] N starting data transfer loop with FDs [0,1] and [6,6]
2018/04/22 21:47:05 socat[15897] I transferred 11 bytes from 0 to 6
2018/04/22 21:47:05 socat[15897] E read(6, 0x1db8100, 8192): Connection refused
2018/04/22 21:47:05 socat[15897] N socket 2 to socket 1 is in error
2018/04/22 21:47:05 socat[15897] N socket 2 (fd 6) is at EOF
2018/04/22 21:47:05 socat[15897] I poll timed out (no data within 0.500000 seconds)
2018/04/22 21:47:05 socat[15897] I shutdown(6, 2)
2018/04/22 21:47:05 socat[15897] N exiting with status 0

@sandain

Latest from the socat.log when I ran mscs status

sandain commented 6 years ago

We might need to explore the -W option that I spoke of to restart the Query handler when it goes down. This is assuming that the Query server starts working again after refusing the connection that caused the Query handler to go down. If the server is going down and not coming back up, there is nothing we can do.

Please check if the Query server comes back up after going down with the following tool: https://dinnerbone.com/minecraft/tools/status/

Aersaud commented 6 years ago

Well it came back up on it's own without me doing anything currently......I will check that website out to query it when it tells me it is down.

Here is the last bit of it going down and up..

I check the status using mscs status every so often to make sure.

2018/04/22 22:54:36 socat[20356] N opening connection to AF=2 127.0.0.1:25566
2018/04/22 22:54:36 socat[20356] I starting connect loop
2018/04/22 22:54:36 socat[20356] I socket(2, 2, 17) -> 6
2018/04/22 22:54:36 socat[20356] N successfully connected from local address AF=2 127.0.0.1:60974
2018/04/22 22:54:36 socat[20356] I resolved and opened all sock addresses
2018/04/22 22:54:36 socat[20356] N starting data transfer loop with FDs [0,1] and [6,6]
2018/04/22 22:55:41 socat[20356] N socket 1 (fd 0) is at EOF
2018/04/22 22:55:41 socat[20356] I shutdown(6, 1)
2018/04/22 22:55:42 socat[20356] I poll timed out (no data within 0.500000 seconds)
2018/04/22 22:55:42 socat[20356] I shutdown(6, 2)
2018/04/22 22:55:42 socat[20356] N exiting with status 0
2018/04/22 22:57:40 socat[22210] I socat by Gerhard Rieger - see www.dest-unreach.org
2018/04/22 22:57:40 socat[22210] I This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (http://www.openssl.o$
2018/04/22 22:57:40 socat[22210] I This product includes software written by Tim Hudson (tjh@cryptsoft.com)
2018/04/22 22:57:40 socat[22210] N reading from and writing to stdio
2018/04/22 22:57:40 socat[22210] N opening connection to AF=2 127.0.0.1:25566
2018/04/22 22:57:40 socat[22210] I starting connect loop
2018/04/22 22:57:40 socat[22210] I socket(2, 2, 17) -> 6
2018/04/22 22:57:40 socat[22210] N successfully connected from local address AF=2 127.0.0.1:48128
2018/04/22 22:57:40 socat[22210] I resolved and opened all sock addresses
2018/04/22 22:57:40 socat[22210] N starting data transfer loop with FDs [0,1] and [6,6]
2018/04/23 14:59:08 socat[22210] I transferred 11 bytes from 0 to 6
2018/04/23 14:59:08 socat[22210] I transferred 13 bytes from 6 to 1
2018/04/23 14:59:09 socat[22210] I transferred 15 bytes from 0 to 6
2018/04/23 14:59:09 socat[22210] I transferred 195 bytes from 6 to 1
2018/04/23 16:12:06 socat[22210] I transferred 11 bytes from 0 to 6
2018/04/23 16:12:06 socat[22210] I transferred 13 bytes from 6 to 1
2018/04/23 16:12:07 socat[22210] I transferred 15 bytes from 0 to 6
2018/04/23 16:12:07 socat[22210] I transferred 202 bytes from 6 to 1
2018/04/23 16:27:01 socat[22210] I transferred 11 bytes from 0 to 6
2018/04/23 16:27:01 socat[22210] I transferred 14 bytes from 6 to 1
2018/04/23 16:27:02 socat[22210] I transferred 15 bytes from 0 to 6
2018/04/23 16:27:02 socat[22210] I transferred 202 bytes from 6 to 1
2018/04/23 16:27:10 socat[22210] I transferred 11 bytes from 0 to 6
2018/04/23 16:27:10 socat[22210] I transferred 14 bytes from 6 to 1
2018/04/23 16:27:11 socat[22210] I transferred 15 bytes from 0 to 6
2018/04/23 16:27:11 socat[22210] I transferred 202 bytes from 6 to 1
2018/04/23 16:28:59 socat[22210] I transferred 11 bytes from 0 to 6
2018/04/23 16:28:59 socat[22210] I transferred 13 bytes from 6 to 1
2018/04/23 16:29:00 socat[22210] I transferred 15 bytes from 0 to 6
2018/04/23 16:29:00 socat[22210] I transferred 202 bytes from 6 to 1
2018/04/23 16:29:51 socat[22210] I transferred 11 bytes from 0 to 6
2018/04/23 16:29:51 socat[22210] I transferred 14 bytes from 6 to 1
2018/04/23 16:29:52 socat[22210] I transferred 15 bytes from 0 to 6
2018/04/23 16:29:52 socat[22210] I transferred 202 bytes from 6 to 1

I have a feeling the 2018/04/22 22:55:42 socat[20356] I poll timed out (no data within 0.500000 seconds) kind of points to our issue as you described earlier where it is too short of a time frame.

@sandain

Roflicide commented 5 years ago

I figured out how to replicate this issue on my machine--

Start three worlds (either all three at the same time or very quickly within one another). Then run mscs status as quickly as possible. Wait a few minutes (for the query servers to start up) and run mscs status again.

Based on my testing, with three worlds started all at the same time using mscs start, the first world will correctly be queried, but the last two will display running (query server offline), even after waiting a significant amount of time for the servers to start up before querying again (like more than 5 minutes). This issue seems to arise whenever I query multiple worlds using mscs status that have been started relatively recently (like within 20 seconds of the server being started up).

I checked the query.in files, and they are all sending the same challenge packet--its just that the latter two aren't getting a response, whereas the first one is getting a response.

When i tested this with two worlds (mscs start then mscs status right after, then waiting a few minutes before running mscs status again) one world showed the query data whereas the second one didnt (it said running (query server offline)).

When i tested this with one world I had no issues.

I don't think its necessarily a problem with the querying portion, because (as said above) it looks like all the initial queries are being built the same. i think it has something to do with this bit of code thats handling the negotiation between the query.in and query.out files:

  # Start a tail process to watch for changes to the query.in file to pipe
  # to the Minecraft query server via socat.  The response from the query
  # server is piped into the query.out file.  Give the server a moment to
  # start before initializing the Query handler.
  nohup sh -c "
    sleep 20;
    tail -f --pid=$SERVER_PID \"$WORLD_DIR/query.in\" |
    $SOCAT - UDP:$SERVER_IP:$QUERY_PORT > \"$WORLD_DIR/query.out\"
  " >/dev/null 2>&1 &
  # Verify the connection to the query server worked.
  if [ $? -ne 0 ]; then
    printf "Error connecting to the query server.\n"
  fi

but i'm not entirely sure, it's just a guess.

@sandain let me know if you can replicate this using the steps i outlined above

sandain commented 5 years ago

I just tested using your method, and it works as expected for me. The first query fails because of the built-in delay (see the sleep 20; line in your code snippet above), but every subsequent query after that timer kicks off works as expected.

sandain commented 4 years ago

I think with PR #242 by @Roflicide, we can finally put this issue to rest. Feel free to reopen this issue if you are still affected.

MinecraftServerControl / mscs

Query server frequently offline #169