Server hosting upgrade - Githubissues

Tromador commented 2 months ago

My colleagues at AOC have set me up with a minimal Centos 9 install, so I can start migrating services including TD to it.

Opening this as a kinda placeholder, to give updates, beg for testing and handle any requests for hosting changes, given we're starting with a clean slate.

@eyeonus I know I've mentioned this before and your eyes glazed over, but if you can set up a DDNS record for your machine, I can have the firewall opened up so you can SSH and/or SFTP directly to the machine. I would also be prepared to extend this to @kfsone if necessary (either on fixed IP address, or DDNS basis).

Dynamic DNS is a way to have a permanent DNS record for your home network, even if your ISP changes your IP address regularly. Usually you'll need to configure your router to send the IP address information to your DDNS provider. Lots more information about DDNS. The place where I get free DDNS hosting.

Tromador commented 2 months ago

As hoped, native support from the OS. Save me a lot of cocking about.

[root@localhost ~]# python3.12
Python 3.12.1 (main, Feb 19 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>>

kfsone commented 2 months ago

I already have one: gate.kfs.org (_I worked at Demon '93-00, so post, gate, etc are in my blood. also, I can't type "demo" to save my life, the muscle memory to hit 'n' is still too strong)

STRONGLY recommend you get a virtual env set up so the system python doesn't get squirrely

mkdir -fp ~/venvs
python3.12 -m venv ~/venvs/venv3.12
. ~/venvs/venv3.12/bin/activate

maybe add some aliases to your startup shell rc

And don't be doing it as root, lol.

Tromador commented 2 months ago

And don't be doing it as root, lol.

Hey - I hadn't created any user accounts at that point, there was only ~~Zuul~~ root.

Also, I got into the bad habit of doing stuff as root on SunOS 3.5 back in '89 and it's ingrained and unlikely to stop any time soon. I don't use sudo nearly as much as I ought to.

That said, I currently run TD server as an unprivileged user and will be doing so again.

System python (out of the box) is 3.9, but 3.12 is a package from the centos repo, so it's also a system python. I will add appropriate alias for the TD user when I get around to creating it. I got to set up apache before I can do that, or it's not going to serve much of anything. I'll look at venv (it's a python thing, so naturally I know {next to} nothing about it).

ObAnecdote: Guess who typed kill 1 instead of kill %1 at about 2am on a university server I was running for a project and had to persuade unconvinced security to let me into a building so I could get to the console and restart the machine.

Is gate.kfsone.org a fixed IP or a DDNS entry?

Tromador commented 2 months ago

Note to self: Listener needs distutils, but this is now rolled into setuptools in Python 3.12 you forgetful bastard. pip install setuptools

kfsone commented 2 months ago

Also, I got into the bad habit of doing stuff as root on SunOS 3.5 back in '89 and it's ingrained and unlikely to stop any time soon. I don't use sudo nearly as much as I ought to.

Kids! sudo -i is your friend :)

to serve much of anything. I'll look at venv (it's a python thing, so naturally I know {next to} nothing about it).

quick recap, no prejudice:

because python is interpreted, whenever you "import" a script, that code is actually executed: "class" and "def" aren't ... structural, they are actual commands that get executed as the interpreter visits them, including when you do them inside a function;
there's no need for a main since Python is already executing everything at the top level,
to allow scripts to become more application-like, python uses a special variable __name__ which is the name of the current file, except when it's the 'main' file, in which case the name is set to "__main__",
that allows a script to be both an import that just declares stuff, and also allows it to operate as a main point,
the "-m" argument to python lets you specify a module to be imported AND treated as main
```
echo '{"hello":"world"}' | python -m json.tool
```
python packages have 3 common install locations: built-in, site, and user, so it has its own 'path' mechanism for modules,
to deconflict package versions, it has to be cunning about which set of packages any given interpreter is going to use,
virtualenv takes advantage of this by creating a directory with copies-of or symlinks-to all the necessary bits and pieces,
it becomes a purely disk-based container for python
python -m venv executes the venv module with __name__ = "__main__" which is equivalent to running the virtualenv command, except you control which python interpreter is invoked rather than relying on the shebang,

Running

python3.12 -m venv /tmp/venv will create /tmp/venv with a self-contained python installation in it. If you invoke that specific python:

/tmp/venv/bin/python

it will constrain itself to only the packages installed in/by that environment. Likewise if you invoke it's pip:

/tmp/venv/bin/pip install traderusty
... installed /tmp/venv/lib/site-packages/traderusty ...
python -c 'import traderusty; print(traderusty.parse_supply_level("100L"))'
error: not found
/tmp/venv/bin/python -c 'import traderusty; print(traderusty.parse_supply_level("100L"))'
(100, 3)

but more usefully it can leverage environment variables so that you don't have to keep remembering which python you're using:

.  /tmp/venv/bin/activate    # < unix, for most shells, 2 spaces just for emphasis
.  /tmp/venv/Scripts/activate.ps1  # < windows, because someone in 2004 thought script files in bin would confuse windows users, I guess that person never used dos (.cmd, .bat. etc)

this will set environment variables and typically morph your prompt to something like:

(venv) $
(venv) $ type python
/tmp/venv/bin/python
(venv) $ type pip
/tmp/venv/bin/pip
(venv) $ deactivate
$ type python
/usr/bin/python

There's a ton of stuff that leverages these sanboxes: ides like vs code, pycharm, eclipse, qt creator; tools like tox.

There are a couple super-useful python tools that leverage it: poetry (which assumes a certain directory layout and uses it to make it trivial to get larger python apps up and running) and pipx (which lets you run python packages without installing them)

ObAnecdote: Guess who typed kill 1 instead of kill %1 at about 2am on a university server I was running for a project and had to persuade unconvinced security to let me into a building so I could get to the console and restart the machine.

When I was working at Demon, I got to watch Ronald Khoo in a hurry do:

ssh ermin xterm
ermin$ su -c "ifconfig eth0 down"

(ermin being one of the two hot-swapping centers of the world for the network, the other being? hint: center of world = cow)

Is gate.kfsone.org a fixed IP or a DDNS entry?

gate.kfs.org is a cname to a ddns. kfs = KingFisher Software, kfs-uk because demon didn't allow 3-letter nodenames in June of 92, became kfs1 in WarBirds became kfsone when they allowed 6-letter names, and because my first gaming job was with former WarBirds people it stuck.

eyeonus commented 2 months ago

@eyeonus I know I've mentioned this before and your eyes glazed over, but if you can set up a DDNS record for your machine, I can have the firewall opened up so you can SSH and/or SFTP directly to the machine. I would also be prepared to extend this to @kfsone if necessary (either on fixed IP address, or DDNS basis).

I don't have access to the router for my LAN at this point. If and when that changes I'll look into getting a DDNS set up.

Tromador commented 2 months ago

@kfsone

Is all that really necessary? Run as an unprivileged user and alias python='python3.12' should do the trick, no?

@eyeonus

Sorry then, $OLDWORK won't let me open up secure shell access to the world at large.

Tromador commented 2 months ago

I don't have access to the router for my LAN at this point. If and when that changes I'll look into getting a DDNS set up.

I hope I remember correctly and you said you use Ubuntu. This might help some instructions. That said, we've managed this long without your having ssh access - it probably would be handy, but if it looks like a struggle, I'm sure we'll continue on just fine.

eyeonus commented 2 months ago

Nope. Arch. But! https://wiki.archlinux.org/title/Dynamic_DNS

kfsone commented 2 months ago

Is all that really necessary? Run as an unprivileged user and alias python='python3.12' should do the trick, no?

"all" boils down to creating the virtualenv dir, and activating it in the .profile or .bashrc file.

Commands like pip will happily uninstall packages or install upgrades in a different part of the python modules path than they were originally in.

virtualenv is the secret sauce to python not intermittently needing obliterating and reinstalling.

venv_location=/opt/venv.312   # or path of your choosing
sudo install -m 0755 -o tradeuser -d $venv_location
echo " . \"${venv_location}/bin/activate\"" >>~tradeuser/.bashrc  # or .zshrc, or .profile per your preference

next time you open a login-shell as tradeuser, it will be using its own python environment safely, and recreating it is a simple matter of deleting the venv dir and repeating those commands.

Tromador commented 2 months ago

Commands like pip will happily uninstall packages or install upgrades in a different part of the python modules path than they were originally in.

All I'm seeing here is how python is a fundamentally broken piece of software that can't keep its versioning straight and breaks things, so rather than fix it, they invented this whole business I now am expected to learn.

I mean, your cut and paste drop in to .bashrc is fine, but does that mean it will take a whole copy of python, /usr/lib/python3.n and whatever else, then do I install modules there or into the system? Do I do this for every python app leading to multiple copies of python strewn across the machine? Is that their grand solution? What a mess!

virtualenv is the secret sauce to python creating a horrible disorganised mess as a workaround to python creating a horrible disorganised mess.

This is exactly the kind of crap I am upgrading the host OS to avoid, with a user having it's own environment, bin, lib, share, etc because the OS won't run modern software. Now you are telling me I still need to make a special environment because python sucks by design!

eyeonus commented 2 months ago

You don't have to use virtual environments. IFF you want to keep X from interfering with Y, you can put one or the other into a venv, like, for example, if your system (X) needs to use python2, but your program uses python3 (Y).

If you're only planning on using this server for the purposes of hosting the TD server, then I don't personally think you need to bother with it.

Tromador commented 2 months ago

You don't have to use virtual environments. IFF you want to keep X from interfering with Y, you can put one or the other into a venv, like, for example, if your system (X) needs to use python2, but your program uses python3 (Y).

If you're only planning on using this server for the purposes of hosting the TD server, then I don't personally think you need to bother with it.

They aren't supposed to interfere at all, that's why I'm kinda horrified. All the binaries have the version (e.g. python3.12, pip3.12) and the support files are also in version specific directories (e.g. /usr/lib/python3.12). The scenario Oliver is painting appears to be one where python will mess itself up regardless.

eyeonus commented 2 months ago

It might, if you run things using python {command} rather than, say python3.12 {command}

kfsone commented 2 months ago

In principle, they shouldn't interfere. In practice they won't interfere unless there is 3rd party influence such as ... using an operating system package manager to install parts of python.

It's really just a microcosm of the issue that all package managers end up with: running brew and port and then updating a package from the apple store? You could use yum, snap, and apt all on the same machine without problems until one of them needs a different GLIBC.

I'm jaded because I've been hot-supporting my employer's python toolchain and over the last year it's been kicking my ass a result of different cadences of python adoption/keep-up between various different vectors.

It's taken for granted that MacOS ships with python, but actually https://developer.apple.com/documentation/macos-release-notes/macos-catalina-10_15-release-notes#Scripting-Language-Runtimes

Scripting language runtimes such as Python, Ruby, and Perl are included in macOS for compatibility with legacy software. Future versions of macOS won’t include scripting language runtimes by default, and might require you to install additional packages. If your software depends on scripting languages, it’s recommended that you bundle the runtime within the app. (49764202)

You'll also find various inflection points with python on modern machines these days where python-facing things will say "oh, I can't do that, because such-and-such is managing the packages".

It isn't really python's fault - they have a well spelled out way to organize things - and people have trampled it.

And - for what little its worth - virtualenv long long before this was really a serious issue, it was created mostly for ci and development purposes.

Another way to think of it is a "green install" of python that knows to stay inside its box.

It will try to use symlinks, but it can only do that so much.

kfsone commented 2 months ago

I was going to demonstrate the os-package vs python-package conflict with centos and yum, but TIL: centos has been discontinued? https://www.redhat.com/en/topics/linux/centos-linux-eol

kfsone commented 2 months ago

Eg1.

Imagine Wheel 2.5 moves exceptions into "wheel.exceptions" and errors into "wheel.errors". You have pip installed wheel 2.5.5 During a security update, openssh-server switches from libxz to a python-backed temporary workaround, which requires wheel but for compatibility reasons pins wheel 2.1.3. yum/apt know nothing about pip packaging, all they know is that python3-wheel is not installed. So now you have wheel/exceptions.py and wheel/errors.py but wheel.py doesn't reference them.

from wheel import exceptions   # <- gets wheel 2.5.5 exceptions
from wheel import package    # <- gets wheel 2.1.3 implementation of package

try:
  package("tradedangerous")
except PackagingError:  # <- suddenly stops working, because the exception wheel.PackagingError not wheel.exceptions.PackagingError
  ...

Eg2. Realizing the problem, you attempt to yum/apt remove python3-wheel

yum/apt again know nothing about Python packaging so they remove the bits of wheel they put there, but the directory is not empty so it is left there. wheel.py goes away but exceptions.py remains, and weird behaviors ensue - the package appears to be both there and not there.

Eg3. A security update forced a new version of Python3.12. A few weeks later, PyPI modules have a related change. The packages you've pip installed are unaffected. A little later still, the downstream debian/yum python3-x wrappings for them DO get updated. One of these packages specifically depended on python's zlib package, and the change removes the zlib dependency.

During this update, your python-level reliance on the zlib package counts for nothing and stuff breaks.

Other places I've seen things get fouled up:

subprocesses that turn out to be python scripts with a shebang that launches a specific interpreter, and runs sub-commands in that context (this is a pita on the new macs because you end up switching between x86 and arm at surprise boundaries)
overly clever shells (zsh),
unexpected environment variables, etc,
bad pip release,
naming conflict with something in the directory you ran a command from,

What makes it start to get really murky is when you learn that a few years ago pip added support for user-installed packages - the equivalent of /bin, /usr/bin, /usr/local/bin, /opt/bin etc. "I can avoid that by not using it" - except there are 3rd party things that specifically use --user without asking, it was considered a best practice for a while, until people realized that now you can have wheel2.5 installed in user-pkgs and wheel 2.1.3 installed in site-pkgs and oddly "from wheel import.exceptions" works "wheel" itself is the 2.1.3 version...

It's also no different than problems you run into when you have multiple compilers for a given language installed; at some point -- a recurring pita as a linux dev these days is if you install gcc before clang, unless you're very specific with env vars and command-line parameters, you will end up unable to compile on really rando things because clang is using some of gcc's headers...

I wouldn't (don't) allow external users into a machine I don't want to have to maintain without 'jailing' them into a virtualenv with:

$pyinterpreter -m venv $venvpath
echo >>$theiruser/$shellrcfile ". \"${venvpath}/bin/activate\""

for all I know it hasn't actually saved me any headaches, but I do know that I had a lot of such headaches before I started doing it :)

Tromador commented 2 months ago

I was going to demonstrate the os-package vs python-package conflict with centos and yum, but TIL: centos has been discontinued? https://www.redhat.com/en/topics/linux/centos-linux-eol

I'll read the longer post in a bit, but Centos has been EOL, Centos Stream is the replacement. The new host is rocking Centos Stream 9. Considering they kept the same version numbering, I dunno why they felt the need to change the name and make a fuss about it, but whatever.

Other than the basic installation of 3.12, I'll be installing modules (any and all modules) with python3.12 -m pip install, so python will be handling that. Anything the system might need it will default to grabbing (and I guess maybe breaking) python3.9 which is the default - I won't be touching or using that for TD, so I think we'll be fine. Otherwise I have to learn and understand venv properly and completely, because I'll have to support it.

External users won't have root access, so they wouldn't be able to add modules etc anyway, unless they build their own python in their $HOME, which is what I've had to do on the older host and what I'm trying to avoid in general.

Tromador commented 2 months ago

@kfsone Can you give me an alternate contact method which is safe for me to send you a password for login to Quoth?

kfsone commented 2 months ago

If you view trade.py its in there - I'd rather not also add it in a ticket comment, I don't want copilot getting all excited.

kfsone commented 1 month ago

Trom/Eye, catch me up on what the goal here is, is a different layout of the data a possibility?

I found one that can reduce a 2.5GB listings.csv to 1GB - so at the very least that's ~2GB less to write, transfer, store, and read. It also gzip compresses (with default compression) down to 102MB where the listings.csv compresses to 250MB

I made an experimental, jsonl-based format but the tricks might be worth using in the csv.

In [6]: %%timeit fh = open("eddb/listings.csv", "r")
   ...: for _ in csv.reader(fh):
   ...:   pass
   ...:
27.3 s ± 380 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %%timeit r = orjsonl.stream("Listing.tdml")
   ...: for _ in r:
   ...:   pass
   ...:
11.3 s ± 81.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In the second, we also had the numbers turned into numbers for us as we went.

The tricks I used might be applicable to the csv format to reduce that:

only print the station id when it changes,
print only the max timestamp for each station,
print the timestamp with the station line,
subtract an "epoch" value from timestamps to make them 2-3 bytes shorter,
take advantage of the list being in-order and print the difference between station ids - this saves 5-7 bytes per station,
print a list of item_ids at the start of the file, and then each station-listing uses the 1-3 digit index rather than the id, saving 5-7 bytes per item
print 'level + 1', so that "unknown" becomes 0 rather than -1, this reduces roughly 1 byte for 30% of the lines in the file,
only print units and level when price is not zero, (in the csv: 0,,)

It looks like this:

{"version":1,"epoch":1700000000,"items_ids":[128049152,1,1,1,1,1,1,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,2,2,1,1,1,3,1,1,1,2,4,2,1,2,1,1,1,3,1,1,2,1,1,1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,2,2,1,2,1,1,1,2,421,1,1,1,14356,602718,1,5,2,1,1,1,1,1,1,1,258,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,583,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,32,257,1,529,1,1,1,1,1,2566,1,324,1,677,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,22,1,1,1,1,131,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,115,1,269,74,1,34,1,1,257,776,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,8168,1,1,1,1,1,1,1,1,1,1,1,45866,4262,1,1,1,1,1,363,4736,1,1,3463,7676,44685,1,13,1,1,1,31338,64031,14153,11009,8856,1,1,1,1,1,1,1,257,1,1,1,1,1,1,1538,1,1,1,1,1,1,1,1,1,34345,2570,21810,19515,12859,3825,1,2828,308,1,2,1,1,2,1,1,1,1,1,1,1,8050,1,1,1],"doc":"https://github.com/eyeonus/Trade-Dangerous"}
{"s":128000000,"t":15464317,"n":132}
[0,0,[81368,885,0]]
[1,0,[54072,12732,0]]
...
[20,[1130,7255,0],[1082,1,0]]
[21,[703,14201,0],[657,1,0]]
...
[382,0,[139596,81,0]]
[387,0,[149240,81,0]]
{"s":512,"t":15369994,"n":56}
[1,[47930,1,0],[47389,1,0]]
[2,[44908,5,0],[44396,1,0]]

The receiver will have to patch the item-id list - I compacted that by printing the change in id number.

{"version":1,"epoch":1700000000,"items_ids":[128049152,1,1,...],"doc":"https://..."}

first line summarizes the format, "epoch" is the number I subtracted from each timestamp, the item_ids is an incremental list, so the second is isn't 1, isn't 128049152+1, etc, doc lets us include a link to an explanation.

after that the remaining lines are either a dict for a new station, or a list.

{"s":128000000,"t":15464317,"n":132}

s is the station id relative to the previous one - so this is 0 + 128000000. t is the timestamp minus epoch, n is the number of lines to follow for this station,

[0,0,[81368,885,0]]

item_index, supply, demand

here the item index is 0 - so it's it is 128049152 from the first line, the second 0 indicates there's no supply, while the demand is [price, units, level]

there are 131 more lines for this station and then

{"s":512,"t":15369994,"n":56}

s is the station_id offset relative to the last station, i.e. this is 128000000 + 512.

Code: https://gist.github.com/kfsone/99ed69bf64103de8577210b5220c7b74

kfsone commented 1 month ago

Second option: A binary format.

I don't yet know whether this is something I'm going to suggest for TD to replace the .prices and .db files with, or a transfer format.

A large part of the performance overheads at the moment is getting the data off disk and into a useful format, and then searching it.

With SQLite we're paying for swings and roundabouts, and then using the space to play football.

I think the next phase needs to be moving to our own binary format(s).

I've written another experiment that reads in a clean listings.csv and then builds a binary-formatted file wastefully - that is, for every station, it allocates enough room to store 512 items or 8kb of listing data for each station. (If we have a station data-modified time instead of per-item, I can reduce that to 2kb per station).

Each station then has two 64-byte "availability mask" which are stored in a separate, contiguous set of blocks so that you can rapidly scan for stations selling specific items.

This turns a 2.5GB listings csv into a 2.94GB .data file.

What I could perhaps do is put all the files together in a subdirectory so that if we need to regenerate/update it, we can do it in a second folder and then swap folders atomically.

This is very much a draft version, and honestly doing bit-twiddling in python feels ... icky; I'd be more inclined to do it in c/c++/go/rust/zig.

But even so, the draft python version reads the 2.5gb listings.csv file in 105 seconds, and writes out the data file (headers, items, stations, supply, and demand) in 30s, with lots of room for optimization - could quite reasonably be parallelized so that the writes are happening while listings.csv is being read,.

code: https://gist.github.com/kfsone/68ae786cd3fe1e4fca36bfc222934900

eyeonus commented 1 month ago

I'm fine with any changes to the data format, both for import purposes and storing in the database, as long as whatever is used has the appropriate information, such as using the 'official' FDev IDs and names for everything, takes account of movable stations, and isn't horribly difficult to squash the bugs we will inevitably encounter.

As far as the timestamp thing is concerned, the reason stations have a timestamp and the items have there own timestamp as well is because a station's market is not always updated whenever a station is.

I'm not entirely sure, and the EDCD schemas don't really help give me a better clue, but it looks like a station gets updated whenever a CMDR docks, but the market is only updated when a CMDR opens the market after docking. I could be wrong about this, but I do know it is possible to have a different timestamp on the items versus the station itself, I've seen quite a few instances of this occurring.

It might be better to use the market timestamp, I believe the station timestamp >= the market timestamp, but I'd have to find a few instances of that being different in order to verify my assumption

eyeonus commented 1 month ago

Regarding specifically the import data format, I believe @Tromador would concur, but if not I'm certain he will say so.

Tromador commented 1 month ago

My goal is to get the hosting up to date and thus get rid of all the kludges I have had to implement to get an application which relies on modern software running on a host with a twenty year old operating system.

I've been doing a couple of bits with the DNS (or rather, sending instructions to my friendly hostmaster) which hopefully will be done by close of business tomorrow and I should be able to start migrating services then, including TD.

From my point of view it's then done. Anything with data formats you guys want to change is fine with me and probably will be easier and more supportable on the current CentOS than what we have now. I have mariadb installed already or can set up postgres. I've also worked with stuff like mongodb in the past if you want to move away from sql completely, makes little difference to me. If you want to go to a proprietary binary format, that's fine too.

Just so long as it's supportable. I don't think eyeonus realises just how much I firewalled a lot of stuff away from him when we were having a lot of problems, but there was a period when I was getting emails or private messages every other day that listener had died or some such. Wasn't any point bothering eyeonus with issues he knew about, so I just did what was necessary to provide service - doing the standard unsung hero sysadmin tasks that are traditionally unappreciated lol.

So I have to be able to support whatever you want to implement. Beyond that, the world is the mollusc of your choice.

These are long term goals though, no? I want TD on the new host asap. I'll then want some testing help, then once we are happy it's a DNS change and we are done.

kfsone commented 1 month ago

@eyeonus separating the "station intrinsic" from the "market update" dateto two fields would still be a huge gain. I suspect - but defer - that it's very unlikely that you'll get out-of-sequence updates to an item over time. For the per-item listing to be valuable you'd need:

T+0, stn 52 items 1, 3, 17, 31 updated T+100, stn 52 items 1, 2, 3, 17, 31 updated T+300, stn 52 items 1, 3, 17, 31 updated T+320, stn 52 item 2 updated with T+280 as timestamp

and to cope with that we have to do a per-line-item date transformation and comparison which is painful when multiplied out by the number of rows we have :)

kfsone commented 1 month ago

(incidentally, I noticed in the current listings.csv there are records from 2020, and also some records that have an oddly suspicious price or units value of 2-to-the-power-of-31 minus one, which looks suspiciously like bad data to me. The next largest value was an order of magnitude smaller...)

eyeonus commented 1 month ago

I do know all the items always have the same timestamp, since they all get updated by the same commodities schema message, so I don't think we'd need to have a full timestamp for every item, just the first. I don't know how easy that would be to implement in comparison nor how much it would save.

Regarding old/bad data, I don't know how much we could do about that, except maybe disqualify it if it seems hokey when it's encountered in the source data?

I am surprised that there exist stations that haven't been visited even once in several years, but then again there are a lot of stations.

Tromador commented 1 month ago

I am surprised that there exist stations that haven't been visited even once in several years, but then again there are a lot of stations.

Remember that I was asking about old stations the other day? I did visit a couple with the oldest data, the ones as you say not visited for years and it quickly became obvious why. From a trade viewpoint, they just suck. Unless there is some good non trade reason to go there, it's hard to imagine any trade algorithm recommending a Commander to visit them. It may be that in some cases old garbage data is the cause, but mostly they really have little to offer players. Often they don't have much in the way of other services, may not have large pads, are in out of the way places or a combination of these. The only reasons I can think of to go there are as I did, out of random curiosity or possibly an altruistic desire to upload new and clean data. My own trip didnt last long, visiting these places was awkward and boring.

eyeonus commented 1 month ago

Hmm, I see no reason not to just remove them from our data, then. Since the listener already has an automatic maxage set, they won't be re-added from the Spansh data just from being too old.

Maybe we should add a purger to the listener to remove old data automatically?

Tromador commented 1 month ago

I've always been loathe to remove functionality that someone may want on occasion for some odd reason. As with many things that TD does, I couldn't find any alternative application which can give such reports on unvisited stations.

So - if we keep it in listener then I guess anyone who does want ancient data can import it to their local db with spansh plugin and that will be fine. The option will be there for anyone who really wants it.

kfsone commented 1 month ago

No need to remove it - remember, guys, I'm not playing ED at the moment so my questions are often seeded by ignorance not some unspoken ultra knowledge :)

Tromador commented 1 month ago

It's not removing anything if there's another way to do it. Keeping a leaner standard dataset whilst leaving a method to get the larger dataset if desired seems like a good idea to me. It's similar to the skipvend option in eddblink. If a player has a use for the extended data, they can get it, whilst for normal use we have a more efficient solution.

rmb4253 commented 1 month ago

The server seems to be no longer updating.

Last updates shown are 21st and 22nd May.

Tromador commented 1 month ago

The server seems to be no longer updating.

Last updates shown are 21st and 22nd May.

Nothing to do with the upgrade, we haven't migrated yet. Opened a new issue.

eyeonus commented 1 month ago

It's not removing anything if there's another way to do it. Keeping a leaner standard dataset whilst leaving a method to get the larger dataset if desired seems like a good idea to me. It's similar to the skipvend option in eddblink. If a player has a use for the extended data, they can get it, whilst for normal use we have a more efficient solution.

So, should I implement automatic purging?

Tromador commented 1 month ago

It's not removing anything if there's another way to do it. Keeping a leaner standard dataset whilst leaving a method to get the larger dataset if desired seems like a good idea to me. It's similar to the skipvend option in eddblink. If a player has a use for the extended data, they can get it, whilst for normal use we have a more efficient solution.

So, should I implement automatic purging?

Sure for the server, depending on processing time for purge vs time saved elsewhere. Obviously if purge adds a huge time burden then it's not so helpful.

eyeonus commented 1 month ago

How old would you say is a good age to purge? 1 month? 1 year? I'm guessing somewhere in between....

I think the best place to put it would be in the update checker, immediately before calling the export.

Tromador commented 1 month ago

No, I think a month is fine. Anything older than that has increasing chances of being increasingly inaccurate and a month's worth of data is still what I like to call "really a lot". Certainly plenty to make money from trading, which ultimately is the point of the application. If people want the older data for any reason, they can import spansh once in a while. Once they've used such data to visit an "old data" station, then (presuming they are using edmc or equivalent) that goes into current data in any event. Honestly we'd probably get away with a fortnight and it would be fit for purpose, I think a month is more than plenty.

On Tue, 28 May 2024 at 03:19, Jonathan Jones @.***> wrote:

How old would you say is a good age to purge? 1 month? 1 year? I'm guessing somewhere in between....

— Reply to this email directly, view it on GitHub https://github.com/eyeonus/Trade-Dangerous/issues/148#issuecomment-2134241714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJGYLCWSGP47VHB7DR73S3ZEPSSFAVCNFSM6AAAAABHAIELAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZUGI2DCNZRGQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Omnia dicta fortiora si dicta Latina!

eyeonus commented 1 month ago

Alright, give me a bit to implement, I'll message you on discord once I've pushed it.

ultimatespirit commented 3 weeks ago

I found one that can reduce a 2.5GB listings.csv to 1GB - so at the very least that's ~2GB less to write, transfer, store, and read. It also gzip compresses (with default compression) down to 102MB where the listings.csv compresses to 250MB

Use zstd at least. I'm not sure how time critical / once-per-day / once-per-hour etc. this listings file is, but if time is less limiting, xz would be even better. Using an old 2.4GB listings.csv I had in my local stores:

$ time tar caf listings.csv.tar.zst listings.csv 

real    0m6.525s
user    0m7.169s
sys 0m2.013s

$ time tar caf listings.csv.tar.xz listings.csv 

real    3m55.975s
user    27m49.788s
sys 0m4.154s

$ time tar caf listings.csv.tar.gz listings.csv 

real    0m56.816s
user    0m56.530s
sys 0m1.818s

$ du -h *
2.4G    listings.csv
340M    listings.csv.tar.gz
275M    listings.csv.tar.zst
120M    listings.csv.tar.xz

All using default settings and on an NVME m2 drive. zstd compresses better than gz does and uses way less time, xz uses way more time but compresses to almost the level your new format compresses to with gz. Just don't use gz unless you're trying to support truly ancient systems still... and even then the application can include its own decompression libraries if it must.

It looks like this:

{"version":1,"epoch":1700000000,"items_ids":[128049152,1,1,1,1,1,1,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,2,2,1,1,1,3,1,1,1,2,4,2,1,2,1,1,1,3,1,1,2,1,1,1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,2,2,1,2,1,1,1,2,421,1,1,1,14356,602718,1,5,2,1,1,1,1,1,1,1,258,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,583,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,32,257,1,529,1,1,1,1,1,2566,1,324,1,677,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,22,1,1,1,1,131,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,115,1,269,74,1,34,1,1,257,776,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,8168,1,1,1,1,1,1,1,1,1,1,1,45866,4262,1,1,1,1,1,363,4736,1,1,3463,7676,44685,1,13,1,1,1,31338,64031,14153,11009,8856,1,1,1,1,1,1,1,257,1,1,1,1,1,1,1538,1,1,1,1,1,1,1,1,1,34345,2570,21810,19515,12859,3825,1,2828,308,1,2,1,1,2,1,1,1,1,1,1,1,8050,1,1,1],"doc":"https://github.com/eyeonus/Trade-Dangerous"}
{"s":128000000,"t":15464317,"n":132}
[0,0,[81368,885,0]]
[1,0,[54072,12732,0]]
...
[20,[1130,7255,0],[1082,1,0]]
[21,[703,14201,0],[657,1,0]]
...
[382,0,[139596,81,0]]
[387,0,[149240,81,0]]
{"s":512,"t":15369994,"n":56}
[1,[47930,1,0],[47389,1,0]]
[2,[44908,5,0],[44396,1,0]]

after that the remaining lines are either a dict for a new station, or a list.

{"s":128000000,"t":15464317,"n":132}

s is the station id relative to the previous one - so this is 0 + 128000000. t is the timestamp minus epoch, n is the number of lines to follow for this station,

[0,0,[81368,885,0]]

item_index, supply, demand

here the item index is 0 - so it's it is 128049152 from the first line, the second 0 indicates there's no supply, while the demand is [price, units, level]

there are 131 more lines for this station and then

As you note in your next comment, you really should just use a binary format at this point. However, if printable characters really are strongly needed, here are some more tricks, since frankly the file's at this point already not particularly human readable anyway.

Encode numeric values above 9 using some maximal variant of base64's characterset. I.e. printable one-byte characters that can be translated back into the corresponding number. There's at least 64 of such characters, I know there are more just don't have an exact number off the top of my head. So the number 128000000 would be representable in 5 base-64 characters (it's between 64^4 and 64^5) for a 4 character reduction. It'll always be at worst as good as the current format, in the case that the number to display was between 0 and 9, and otherwise start reducing characters pretty nicely as numbers get larger.
Similarly, since the format is fully defined by yourself at this point just drop all dictionary-style "member_name" headings, aside from at best perhaps defining it in the first line. Just have those fields as tuples of values with the dictionary member structure implied by the order.
You don't need the "number of lines to follow" field if you keep those "header" lines using a differing starting character from the list format lines, just use the presence of the "header" start line as an indication that the list ended.
It seems item index is always just an increasing count? Its information is similarly redundant if so as keeping a running count of lines parsed would suffice to be the item_index. If there can be gaps then use [index_diff,supply,demand] where index_diff is a number to add to the last index (as derived from the line count by default) to get the actual index, print nothing if it's '1', using the [,supply,demand]'s initial , to indicate the "just add 1" case. This once again at worse is as bad as the current format, at best will save increasingly many bytes for larger lists of numbers (due to not printing each digit of the number). Of course, the numbers used should be encoded with a higher base as discussed above too.
Technically you could just drop the final closing bracket of all lines if newlines are being enforced, i.e. have the newline functionally act as the ending bracket. Alternatively could strip newlines using the ending bracket + no comma to indicate a newline. Either of these are honestly pretty minor in the face of modern compression algorithms and I wouldn't really expect much savings of the compressed file. If the number of lines stayed the same then for my 41,438,590 lines 2.4GB listings.csv that single byte save per line would be around 41MB of the non-compressed on-disk file.

Oh, also,

I'm not entirely sure, and the EDCD schemas don't really help give me a better clue, but it looks like a station gets updated whenever a CMDR docks, but the market is only updated when a CMDR opens the market after docking. I could be wrong about this, but I do know it is possible to have a different timestamp on the items versus the station itself, I've seen quite a few instances of this occurring.

That's correct, upon docking with a station the station's basic information gets updated (provided the commander has relevant telemetry tools), but only upon opening the outfitting page, the shipyard page, or the commodities page do the related journal files get updated, and consequently only then does new data get sent to EDDN. This also means it holds true that a station's update time must be less than or equal to its commodities' update times, for those prices to be "current" (i.e. someone didn't just dock then leave).

Tromador commented 3 weeks ago

You guys might want to move your data format & compression discussions out of this ticket. Once I'm happy the existing application is working on the new host I do plan to close this ticket, so off topic stuff might be lost. Unless it's directly related to setup on the new host, can it be discussed elsewhere please.

A note about hardware though, being as nvme m2 drives were mentioned - The TD Linux host machine is a vhost running (I believe) on a Windows server with Hyper-V. I have no clue on the precise hardware spec of the windows host, but bearing in mind when I was a director I imagine the plan is still making the best use of older hardware because it fulfills the requirement and is much less expensive. I would not be surprised to find a large array of spinning rust, possibly in a separate rackmount enclosure from the CPUs (likely lots of Xeons). When I did work there, we had half a dozen racks full of stuff in air conditioned server room with a pair of redundant leased line Internet connections coming from two different directions (was fun getting permission to dig up the car park). So be thinking terms of data center kit, not what you might have (if you are anything like me) in your home gaming PC.

ultimatespirit commented 3 weeks ago

You guys might want to move your data format & compression discussions out of this ticket.

Yea, that's a good idea. Though don't worry about closing the issue, github issues even when closed retain history (unless someone goes out of their way to delete it).

Did want to clarify though, for the hardware thing, I figured the TD server was going to be more storage speed gated than CPU gated based on what you've mentioned around in the past. That's actually why I mentioned NVME M2s, my intention was to basically say "these are purely CPU numbers, storage related speeds won't make it any faster at least", i.e. the performance could only go down from my results above from storage. In reality I'd expect that server to compress a bit better actually due to likely having better CPUs / more CPUs than my desktop computer has, granted that really only will matter for if you wanted to use xz, zstd already basically just takes as long as it takes to write to disk anyway.

Tromador commented 3 weeks ago

Did want to clarify though, for the hardware thing, I figured the TD server was going to be more storage speed gated than CPU gated based on what you've mentioned around in the past.

Actually honestly not sure about that. Because TD runs on a single core, a lot of the possibilities of threading on Xeon CPUs aren't taken advantage of. We briefly (by we I mean I bitched and eyeonus actually did the work) flirted with the idea of proper multiprocessing rather than the current threading (which in python means single core) but it was determined to be "hard and maybe to be addressed later" on account of (iirc) some global variables which need to be available to all threads at all times.

So in practice, my home PC runs TD faster, because the individual cores on my I9 are normally faster. Since going to python 3.12 it does appear to offload a small part of the load onto a second processor, which I guess is python being clever in some way, but still the speed of a single Xeon core on the server isn't particularly special. If it could be rewritten to take advantage of proper SMP, then I imagine the server would be blindingly fast.

That said, there are a couple of tasks (spansh import for example) which do seem to happen much faster on the server and I think this must be due to the speed of the storage. I would actually be astonished storage is really a bottleneck, I don't know if they are using spinning rust or solid state atm (I guess I could ask... ok email sent) but either way it will be many discrete drives all serving data simultaneously, so it's almost certainly the speed of the bus, not the speed of the individual drives that bottlenecks the storage and they won't have skimped on that (or at least we didn't when I was a director and the other two directors haven't changed).

Progess report: I have the web server working, finally. For some reason, there always seems to be some idiot niggle with getting it going, the border firewall, internal firewall, the stupid context, something. And generally not the same niggle as the last time you did this with earlier version of OS/Apache. I've asked for a temp hostname to be set up and once my hostmaster sorts that for me I should be able to give a test url for you lot to abuse.

Tromador commented 3 weeks ago

Storage solution is an HP 2050 SAN full of spinning rust in various RAID arrays via SAS. This attached to a high availability cluster with failover all on an 8GB fibre. So as suspected, slightly older, but proper data centre kit.

eyeonus commented 3 weeks ago

And here I am without even a NAS to call my own ;)

Tromador commented 3 weeks ago

Anyone who is willing to test the new server, it's up and running. You'll need to change the URL in the EDDB plugin code from elite.tromador.com to test.tromador.com where appropriate.

eyeonus commented 3 weeks ago

You can also pass TD_SERVER="https://test.tromador.com/files/" as an environment variable if you don't want to tinker in the code.

kfsone commented 3 weeks ago

Hey, last couple of weeks have been hectic, I should get time to contribute again this weekend.

Tromador commented 2 weeks ago

Has anyone been able to test this for me? (other than @eyeonus).

Another one successful test person would be a good confidence boost and then I can send it live.

eyeonus / Trade-Dangerous

Server hosting upgrade #148