Extreme slowness with `/replay list` with 13,000 replays

allejo commented 3 years ago

So I'm kinda stumped on how to proceed with fixing this situation. On my Leagues United replay server, I have over 13,000 replays and running /replay list is extremely inefficient and leads to server freezes.

https://github.com/BZFlag-Dev/bzflag/blob/3be72c3aeba3f5f5e118a8279fb885333dc2b5b7/src/bzfs/RecordReplay.cxx#L865-L942

What I've gathered from reading through this code is: opening up 13,000 files and extracting the match date isn't the most efficient solution. Even though sendFileList will only output 100 results to the user as private messages, the server is still processing all 13,000 replays.

~~My thought for an immediate fix (2.4) is to change this functionality to use the filesystem "last modified" date for the sorting by time even though this can sometimes be inaccurate.~~ (This is not the cause of the slowdown so doing this would prove useless)

For 2.6, I would think having a replay server doing some sort of caching on boot to optimize these listings would be ideal.

atupone commented 3 years ago

Have you identified if the slowness is on reading file or sorting? If it is while reading each file we could try to read each file only once, to match the magic and to read the time. Actually we do a fread, rewind, fseek and fread, I think we could do only an fread merging the operation.

If it is while sorting at the best sort can do n*log(n) but if the results are only max 100 we could do that operation in 100n that for very big big n can be lower :(

jwmelto commented 3 years ago

Basic optimization rules apply. For any operation that is non-performant, you can

Do it less often
Do it faster
Do it in parallel

If scanning 13,000 files is too costly (one might imagine so), you can perform a time/memory exchange and store the results, either in memory or create a database or sorts. It is the nature of this problem that the information of each file doesn't change, so really only new files need to be scanned. This is a form of "do it less often" optimization.

Let's suppose you had a cached replay list, replay_list.txt. While not super speedy, I can imagine that iterating over the directory (BTW, C++17 std::filesystem helps here) and only keeping those files newer than the last modification time of the cache would be more performant.

The challenge is one of performance. How quickly after a new replay is created do you want it to be seen in response to /replay list? That's how often you need to re-evaluate this cache. Or, more precisely, how often you need to check for new files.

FWIW, you could use std::async to kick off this indexing task and to periodically update at whatever interval you need to monitor.

blast007 commented 3 years ago

How long does it take bzfs to scan through those 13,000 replay files?

Zehra commented 3 years ago

On server startup, it could lookup the files and cache the results. Another possible solution or remedy is to simply store the results in a simple database and have a plug-in lookup the results, as the API allows overriding default slash commands.

allejo commented 3 years ago

Doing some very basic/stupid profiling on where the slow down is, I've added some calls to bz_getCurrentTime and I'm getting the following information:

sendFileList before getFileList(): 16.0285
sendFileList after getFileList(): 115.735
sendFileList after sorting: 115.735
sendFileList after sendMessage loop: 115.735

Removing the call to loadFileTime does not appear to make a difference. In retrospect, that does make sense since the file is already open in memory.

Sorting seems is pretty instant, so it appears to me that the biggest slow down is opening the files.

jwmelto commented 3 years ago

13,000 files will stress the OS I/O cache. 26,000 will be even worse... Starting an asynchronous cache of the required information at server start is a pretty easy fix

atupone commented 3 years ago

Some suggestion that I found is to use openat instead of open and pread instead of fseek and fread

red15 commented 2 years ago

How big of an issue would it be if you cannot do /replay load #101 ? ie: be able to select from an index beyond the limit of 100 ?

The approach I'm taking is twofold:

also check the pattern matching during getFileList as to avoid iterating over everything when doing a pattern search
limit the number of files being read (following the generic optimization advice point 1 - do it less often)

Also a small side question, doing a /replay list <query> and then follow that up with a /replay play #n , there is no way that index number you're passing would actually select the n-th replay of that search query right ?

blast007 commented 2 years ago

It is the nature of this problem that the information of each file doesn't change, so really only new files need to be scanned.

Something to keep in mind is that most recordings are actively streamed to the disk. In this mode, the replay header is initially written with a recording duration of 0, and when the recording is stopped this duration is updated. So if we're able to read the recording while it's being actively written (I've not tested that), you would not want to cache information about recordings that have a duration of 0.

blast007 commented 2 years ago

I've been looking into this further the last few days. When we're listing replays, we loop through all of them, open the file, and run two fread calls. The first fread is to read the magic string to identify if it is a BZFlag replay file. Then we seek ahead a bit and read the replay duration, which we show in the list output. So far it looks like just the act of running a fread for the magic string on thousands of files takes a good chunk of time.

I made two example directories of replays, one with 3000 files of 16MB each (some tournament replay I had) and 15000 replay files I created with a basic program that just writes out the replay header and a world database dumped from an empty world file (502 bytes per replay).

I compiled this minimal program:

#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <iostream>

typedef uint32_t u32;

int main() {
        std::string RecordDir = "recordings/";

        DIR* dir = opendir(RecordDir.c_str());
        if (dir == NULL)
                return 1;

        struct dirent *de;

        unsigned long filesRead = 0;

        while ((de = readdir(dir)) != NULL)
        {
                std::string name = RecordDir;
                name += de->d_name;
                FILE* file = fopen(name.c_str(), "rb");
                u32 magic;
                char buffer[sizeof(magic)];
                if (fread(buffer, sizeof(magic), 1, file) != 1)
                {
                        fclose(file);
                        continue;
                }
                filesRead++;
                fclose(file);
        }

        std::cout << "Total files read: " << filesRead << std::endl;

        return 0;
}

I ran the test on a Debian system with an Intel Core 2 Duo E8400 3GHz, 6GB RAM, and 80GB SATA hard drive. With 3000 files the program took 40 seconds and the 15000 files took 94 seconds. Before each run I executed as root sync; echo 3 > /proc/sys/vm/drop_caches to flush the caches.

Things to think about:

Non-replay servers can list replays with /replay list and /record list, so maybe we should disallow that to prevent live game servers from freezing in the case of lots of replays.
We only display 100 replays in a list, so if we knew which files we were showing we could read just those 100 files. Though even with just 100 files, it still takes about 1 second on the above test system.
However, if we don't read the info from all the files, we can't sort by recording time (via /replay list -t).
We could, as mentioned, read info about the recordings when we start up and cache that. I'd have to check if a recording can be overwritten, which would require that we somehow detect that and invalidate the cache for that file.

atupone commented 2 years ago

I have not tested it, and cannot, but maybe this is faster:

#include <iostream>
#include <dirent.h>
#include <fcntl.h>
#include <unistd.h>

typedef uint32_t u32;

int main() {
        std::string RecordDir = "recordings/";

        DIR* dir = opendir(RecordDir.c_str());
        if (dir == NULL)
                return 1;
        int dfd = open(RecordDir.c_str(), O_RDONLY);

        struct dirent *de;

        unsigned long filesRead = 0;

        while ((de = readdir(dir)) != NULL)
        {
                int file = openat(dfd, de->d_name, O_RDONLY);
                u32 magic;
                char buffer[sizeof(magic)];
                if (read(file, buffer, sizeof(magic)) != sizeof(magic))
                {
                        close(file);
                        continue;
                }
                filesRead++;
                close(file);
        }

        std::cout << "Total files read: " << filesRead << std::endl;

        return 0;
}

blast007 commented 2 years ago

I just tested that variant and it takes basically the same amount of time.

blast007 commented 2 years ago

In case this is helpful to anyone, this is the output from two consecutive runs of my original test program, with the first time being with caches cleared.

blast007@debian:~$ /usr/bin/time ./filetest
Total files read: 3001
0.02user 0.13system 0:41.01elapsed 0%CPU (0avgtext+0avgdata 3488maxresident)k
102104inputs+0outputs (25major+137minor)pagefaults 0swaps

blast007@debian:~$ /usr/bin/time ./filetest
Total files read: 3001
0.01user 0.03system 0:00.04elapsed 93%CPU (0avgtext+0avgdata 3468maxresident)k
0inputs+0outputs (0major+148minor)pagefaults 0swaps

I'll also attach my 15000 generated replay files. They're enough for listing replays, but will error out if you try to actually load one since it's missing data after the world database. They even have randomized replay durations, so sorting by duration is an option for testing performance.

jwmelto commented 2 years ago

Do we have a target for how fast we should be able to read the headers of 15000 files? How fast is fast enough?

jwmelto commented 2 years ago

Using modern C++,

#include <filesystem>
#include <fstream>
#include <iostream>
#include <chrono>

void test(std::filesystem::path const& path)
{
  for (auto const& dir_ent : std::filesystem::directory_iterator(path))
    {
      std::ifstream f(dir_ent.path(), std::ios_base::in|std::ios::binary);
      if (!f.good()) { std::cout << "problem opening " << dir_ent.path() << std::endl; return; }

      uint32_t magic;
      auto expected = sizeof(magic);

      f.read(reinterpret_cast<char*>(&magic), expected);
      auto nRead = f.tellg();
      if (nRead != expected)
    {
      std::cout << "problem reading " << dir_ent.path()
            << ", read " << nRead << " of " << expected << " bytes."
            << std::endl;
    }
    }
}

int main()
{
  auto path = std::filesystem::path("recordings");
  auto start = std::chrono::steady_clock::now();
  test(path);
  auto stop = std::chrono::steady_clock::now();

  std::cout << "Elapsed: "
        << std::chrono::duration_cast<std::chrono::duration<double>>(stop - start).count()
        << " seconds"
        << std::endl;

  return 0;
}

iMac 3.3GHz Intel Core i5, 32GB 2667 MHz DDR4, 2TB Flash storage

Ran your files in 2 seconds

jwmelto@Goliath bzflag % ./a.out
Elapsed: 2.23998 seconds
jwmelto@Goliath bzflag % ./a.out
Elapsed: 0.265397 seconds

Does that help?

blast007 commented 2 years ago

Basically the same on my test system (with the 3001 x 16MB replays):

blast007@debian:~$ /usr/bin/time ./filetest3
Elapsed: 40.2098 seconds
0.04user 0.14system 0:40.37elapsed 0%CPU (0avgtext+0avgdata 3644maxresident)k
102456inputs+0outputs (26major+140minor)pagefaults 0swaps

For the 15000 files, here's what it's like on that Core 2 Duo system with the hard drive:

blast007@debian:~$ /usr/bin/time ./filetest3
Elapsed: 93.2741 seconds
0.13user 0.66system 1:33.44elapsed 0%CPU (0avgtext+0avgdata 3504maxresident)k
133448inputs+0outputs (26major+138minor)pagefaults 0swaps

And on a $6/month Digital Ocean droplet (1 shared CPU core, 25GB SSD, 1GB RAM):

blast007@xs:~$ /usr/bin/time ./filetest3
Elapsed: 4.91171 seconds
0.18user 0.69system 0:04.92elapsed 17%CPU (0avgtext+0avgdata 3564maxresident)k
132480inputs+0outputs (9major+145minor)pagefaults 0swaps

jwmelto commented 2 years ago

looks like your files are on a network drive? That's going to be painfully slow.

blast007 commented 2 years ago

No, there are no network drives involved in any of this.

BZFlag-Dev / bzflag

Extreme slowness with `/replay list` with 13,000 replays #277