eyeonus / Trade-Dangerous

Mozilla Public License 2.0
96 stars 31 forks source link

Database / needed file size optimisations #181

Open ultimatespirit opened 3 weeks ago

ultimatespirit commented 3 weeks ago

As requested in the server hosting upgrade issue, here's a new issue to discuss the data storage problems / optimisations that can be done.

I already mentioned a few things here, albeit entirely based off of what I was reading in that issue rather than from actually looking at the database or related files. I'm going to list here every area that could be changed to reduce the on-disk and server bandwidth requirements of trade-dangerous. Many of these are likely already known, but I figure it is good to explicitly list them in one location (and since I don't know what is or is not known).

Note to start with, I am going into this with the following basic assumptions, assumptions I know do not hold true for the current form of the database, to some extent at least, due to historical considerations (everything used to be human input data):

With those out of the way, let's start with the SQL database itself, taking the format of listing how it currently is, and how it can be changed.

N.B. I am not an SQL expert, I am using this SQL datatype reference sheet for what a normal strictly typed SQL database type sizes would likely be. This project uses sqlite3 it seems, so I'm also using this resource on types, with the caveat that I'm assuming the sane option of strict typing is being used, since there's no way variable typing is helpful here. Oh, also, I'm looking exclusively at what my own local database has inside of it from a working copy of trade dangerous.

SQL Database

First off, the database has 14 tables:

I'm going to ignore "Added" and "sqlite_sequence", as I have absolutely no idea what those tables mean. It seems "Added" is some way of marking from what era of the game some information is, but I'm not sure it's relevant anymore. I'm also going to ignore both of the FDev tables, they appear to be tracking data from a master fdev list or otherwise are not exactly under control of the program.

Category

This is an integer ID field and a VARCHAR(40) name. It appears to represent the per-item categories from the in game UI. This is of questionable value to the program and likely could be removed.

Item

Contains:

If the Category table gets removed, then category_id can be removed. Similarly, ui_order is of questionable value. Otherwise, there isn't much to do here.

I'll elide any INTEGER tags for id variables going forward.

RareItem

More or less as good as it can get, except...

Ternaries

I'm not quite sure what "suppressed" is, but it and "illegal" represent our first ternary datatype analogue. According to the sql server reference TEXT being variable length requires 4 + number of characters byes. sqlite says nothing specific about storage requirements, solely that it encodes it in the encoding of the database. I'm assuming that's UTF-8, and since it's variable length it needs at minimum a null byte or similar concept, so let's say 2 bytes per TEXT(1) minimum.

I don't know how null values are handled by sqlite with regards to storage size, so I will ignore that representing a ternary as "1, 0, or NULL" could potentially save a byte on the NULL case (by representing NULL through storing nothing at all). Instead, we have two options:

The biggest possible savings would be under the assumption, however, that it is of merit to combine values into one field. That's likely much harder to efficiently search for / filter for (unless it's something SQL engines can handle really well?), so understandable if they need to remain distinct. Still, likely can save at least 1 byte per ternary.

Ship

ShipVendor

This and UpgradeVendor appear to be empty. This information actually is available to automated systems, it's just not written to a journal file but instead to transient files. My understanding is that the main data network systems after EDDB closed (under the assumption EDDB had this info) saw fit to just stop attempting to keep the outfiting / ship information.

I'm actually interested in the historical trends for this information though, so for what it's worth I may create an EDMC plugin that could extract this information and feed it to a database.

Otherwise I guess they could be removed, along with their corresponding ID tables.

Station

This has 9 ternaries (max_pad_size is actually a quaternary but that still fits in 2 bits or one byte), so at least 9 bytes could be saved from the ternary considerations above, more if they can be combined into packed integers. For the purposes of this program itself, I'm not sure rearm or refuel are that helpful, but they are generally useful information to keep so likely worth keeping anyway.

The date, however, is interesting for the modified field. Station modified time is of little use. All of its parameters are either constants, barring the edge case of marking a station as destroyed / abandoned / under attack, or have modifications out of sync with the station itself, like its commodities market vs outfitting vs shipyard. This likely could be removed, with some caveats covered in later sections.

Also, the edgecase of a abandoned / under attack / reduced services stations is an admittedly interesting / useful one, but for a trading optimiser perhaps fine to just ignore?

System

Added likely could be removed, and modified is of questionable use, as it's dependent on the internal latest update time of a station, or on outright new data being provided. It's likely not that useful to store for the edgecase of someone wanting to check a completely newly discoved / expanded system was updated recently or not on existence of stations.

StationItem

This one's the big ticket item. A representation of every commodity known to the database at every station. Lots and lots of entries.

It's highly doubtful those level variables are useful, remove for 2 bytes of savings (assuming they're one byte each). More importantly, under the completeness assumption, we do not need modified, and can instead store that information in the associated station (using a new date field for last commodity market update, last shipyard update, last outfitting, etc.). That should save 4-8 bytes for every single item, in my current database that's 44,169,051 items. So around 300 megabytes (if there's no internal compression involved that cuts it down).

And that's it for the sql database tables. Well, with one caveat, it was mentioned that for efficiency StationItem should likely be split up into two tables, one for selling and one for buying, so that the view StationBuying and StationSelling don't need to exist. I'm not sure on the specifics there or if that's outdated information now (it was mentioned like a month ago in an issue), but it doesn't majorly affect the storage requirements provided that the date field gets placed into the station it's associated with. That being said, it's likely many operations that require sale info would also benefit from buy info (making a bi-directional trade route for example) and vice-versa, so may just not need the buying/selling specific views anyway, just process the StationItem data in Python after retrieval.

The CSV format

I'm not really sure what the distinction is between listings.csv and listings-live.csv. I'm assuming these files can be changed in their format, or at least what gets served by the eddblink server can be changed. Both have the same format: id,station_id,commodity_id,supply,supply_bracket,buy_price,sell_price,demand,demand_bracket,collected_at

id

The id field appears to not be related to any of the SQL parameters above, in fact, it appears to literally just be the line number. Absolutely can be removed and save around O(n log(n)) bytes, where n is the number of lines in the file, which is around 4-5 million lines, so something like 20-25mb.

station_id and commodity_id

The data appears to be sorted by station_id followed by commodity_id, if it isn't do that, and then either store the station_id info implicitly by making it only in the first entry, any empty entry afterwards represents the same id, or do it as a diff from the last entry. For commodity_id an empty value could mean "last value +1", or again just diff encode it from the previous line. Can choose a reasonable baseline number that all commodity IDs start from too to make the first entry's diff less special. Same for station_id. Should save a decent amount of on-disk space.

supply, demand, and {buy,sell}_price

Again, nothing much to really do here.

{supply,demand}_bracket

This is presumably the {supply,demand}_level indicator. Should just be removed. Alternatively, make its range 0 to 4 instead of -1 to 3, as it appears the vast majority of items have -1 demand and that's two bytes for no good reason. Granted, compression algorithms will likely adequately handle either case, it's still not great.

collected_at

This appears to be the UNIX epoch date of last update. As the data gets compressed before sending, the duplicity of these entries likely gets fairly well handled by the compression. However, at the cost of perhaps extra work to process the file, the datetime could be stored only in the very first item per-station, or in a separate file altogether associating station ids to update times., which would save probably around 40mb on disk uncompressed.

Alternatively to much of this, create a binary data format that packs the data as tightly as it can. If applicable use variable length integers for supply and demand (you lose 1 bit per byte, in exchange for lower values not needing as many bytes, could do 1 bit per two bytes using a base of two bytes per variable length word). Store stations as one large variable length struct, store its length explicitly, removing the date duplication implicitly. IDs could still be diffed from a baseline or stored implicitly in some ways.

Server Bandwidth / in flight data

Running the eddblink import causes my system to download around 200mb or so of data every update, though perhaps if I called it every day it'd be less (the one time I called it within 24 hours of the last invocation it was much less greedy). This is mostly from downloading the listings.csv gzipped.

There's a few things that would help here:

As the size seems to be consistent every time I get listings.csv I'm guessing the whole file is getting sent every time regardless of if the client has downloaded it before or not, aside from within the same day where I think listings-live.csv comes into play as likely a rolling summary of the day's info. This is pretty inefficient, as I'd guess given the game's population not nearly as many locations change their data in a day as the full listings.csv file.

If that assumption holds true, in exchange for slightly more storage usage on the server, it would likely make sense to store what are essentially diffs from day to day, as follows:

The update process on the client side would then be:

Essentially, this creates an end-of-day transaction log for the data, under the assumption that the set of things that get updates is smaller than the full listings.csv file. If it's roughly equivalent then pre-made diffs will be less helpful bandwidth wise unless they're diffs from each day to current day, and even then barely.

Conclusion

Whew, this ended up being a real long one. Hope it can be helpful in discussion.

eyeonus commented 3 weeks ago

I'm going to just be giving an explanation of the things mentioned, as they are a few things you state you don't understand, I still have to finish reading your post.

SQL Database

I'm going to ignore "Added" and "sqlite_sequence", as I have absolutely no idea what those tables mean. Added use is to mark when a system was added to the game. I don't believe it's needed, it remains due to historical inertia.

sqlite_sequence only contains one value, name:"Added", seq:36, which is the number of rows in Added. It would probably be removed if Added was. It is not in the sql or mentioned in TD, so it's probably an internal table auto-generated by SQLite3 when the DB is created.

ShipVendor

Ship, ShipVendor, Upgrade, and UpgradeVendor are populated by the spansh plugin with TD >= 11.5.0

Once Tromador updates the server, this will also be true with the eddblink plugin.

Ship.cost is no longer NOT NULL, and the fields in Upgrade are changed, due to what data is available via Spansh. (NOTE: Since galaxy_stations.json is huge and has no whitespace, running the spansh plugin with -wwill cause it to output the "Shinrarta Dezhra" system to ./tmp/shin_dez.json with an indent of 4, since "Jameson Memorial" has all the ships and modules.)

StationItem

* `from_live`

This is used to mark if the data is from Spansh's nightly dump or from the live feed from EDDN.

The server runs tradedangerous-listener, a program that uses TD to provide the files eddblink imports.

The server imports the Spansh data when that gets updated, and then performs its daily dump to listings.csv, giving every entry from_live = 0.

Any entry updated by a message from EDDN is marked for inclusion for the differential dump to listings-live.csv with from_live = 1. The differential dumps are performed every 30 minutes by default.

Making it an integer was a bad idea, blame my then SQL-ignorant self.

The CSV format

I'm not really sure what the distinction is between listings.csv and listings-live.csv. I

Explained above. listings.csv is the full dump provided by the server, listings-live.csv is the differential dump.

Any format changes would have to be reflected in both the listener and the eddblink plugin. There are no other considerations I'm aware of that prevent or discourage making changes to the format.

Server Bandwidth / in flight data

* Consider if you even need to be sending the full listing so often.

That is a bug I thought I'd fixed. It should only download the file if the copy on the server is newer than the local copy, by checking the modified time for both. The plugin is supposed to set the modified time to be equal to that of the one on the server once it's finished being downloaded to fix the problem of OSes setting the modified time of the local copy to the local time, resulting in anyone in a negative time zone (i.e., UTC-4) always having an "older" version.

https://github.com/eyeonus/Trade-Dangerous/blob/e059051ca9ef57eff2d9441134a84b67918a00a9/tradedangerous/plugins/eddblink_plug.py#L157C1-L158C60

        # Change the timestamps on the file so they match the website
        os.utime(localPath, (dump_mod_time, dump_mod_time))
ultimatespirit commented 3 weeks ago

Ship, ShipVendor, Upgrade, and UpgradeVendor are populated by the spansh plugin with TD >= 11.5.0 ... the fields in Upgrade are changed, due to what data is available via Spansh.

Great to hear. Personally, I'd suggest renaming Upgrade to Modules at some point since it's going to continue to be supported, since that's what they're actually called in game. But understandable historical inertia.

This is used to mark if the data is from Spansh's nightly dump or from the live feed from EDDN.

As I'm not familiar with the data contained in spansh's nightly dumps too well, what is the difference between the two? My understanding was spansh was aggregating the data from EDDN in the end / EDDN is the primary source of data at this time in the community. So the spansh nightly dump of things updated in the last 24 hours should be almost as equivalent as the server's listener (barring network blips on the server side, or spansh's side for that matter), no?

As for marking data changes within the last window between dumps, there's no real reason to store that within the main database, especially if it's primarily for the server. Just create an extra database copy of just the new items, or a listing of the updated IDs that the server can load up (in a separate application / process) to dump out the relevant items to. Or just create that listing as a streaming file (either don't prune duplicates, or do so periodically).

Now, it could be worthwhile to make any client capable of doing what the server does directly from TD (i.e. wrap the listener in as a part of the repo) in which case perhaps having one database listing what needs to be dumped next could be useful even off the "server". But then, rather than paying the cost on every StationItem, make it a separate table that only stores the StationItem ids that have changed, and clear said table when a dump is finalised (or just again make it a fully separate listing, especially with TD being a python app which is relatively unparallelisable, it's very likely you wouldn't see any real performance cost from having a separate process as a dedicated "dump latest change into, consolidate those changes into a unique listing, and dump out the dump file on demand" type service, maybe caching the item IDs immediately so it's not all in memory).

The server runs tradedangerous-listener, a program that uses TD to provide the files eddblink imports. The server imports the Spansh data when that gets updated, and then performs its daily dump to listings.csv, giving

I'm familiar with tradedangerous-listener, in the sense that I've run it myself too (stopped doing so when I realised it seemed to forcibly completely redownload the initial day's data every hour rather than only listening to EDDN, duplicating work a lot). Never looked into how it generated listings.csv though. So, listings.csv is solely the items that got updated in the last 24 hours every day? I'm honestly astonished that every day 250mb of data, even if there's a lot of redundancy there that can cut it down, is generated for this game. I knew the spansh nightly dumps were around a gig usually but thought that would be mainly from uninhabited system updates / general system updates rather than commodity market data... Pretty cool.

That is a bug I thought I'd fixed. It should only download the file if the copy on the server is newer than the local copy, by checking the modified time for both.

Oh, you did. I was referring to if the full amount needed to be sent every day to begin with. It sounds like it's truly a daily dump from what you were saying above, so that's a mistaken impression on my behalf.

That being said...

The plugin is supposed to set the modified time to be equal to that of the one on the server once it's finished being downloaded to fix the problem of OSes setting the modified time of the local copy to the local time,

That... is not supposed to happen on most any OS, or rather file system driver, that I know of? Modified times should be stored in UTC internally usually, it's just display that'll be timezone aware. In python the straight mtime field of the stat structure should be UTC / using os.path.getmtime(). It seems odd for you to have needed to change the modification times.

Anyway, thanks for working on all these things. I don't mean to come off as just complaining a lot and not doing anything or anything like that. Just not familiar enough with the codebase to make or analyse these modifications beyond suggestions of datatypes at the moment, maybe I'll get the time to read the code more. Was meaning to refactor the trade and buy commands at some point anyway in response to my other issue opened about trade...

kfsone commented 3 weeks ago

We should be able to dispose of "Added", now that we have schema changing capabilities.

On how to structure tables - replacing the 8 flags (has market, has black market, etc) with two 8-bit flag masks might be a good idea, but temper against how expensive the bit-twiddling will be from within python.

Also, there's a balance to temper how you organize the data: my original intention was that SQLite would just be an ACID means of reading/writing to disk with transactions, but all queries would go thru the TradeDB object.

Now, however, there are lots of queries that instead lean into SQLite and depend on indexes etc.

I put together a Jupyter Notebook / Colab project examining the various effects of having a single flag (e.g "has market") stored as a character, or an int, nullable or not, with or without the rowid column (if we're going to store unique IDs, we don't need a row id)

https://gist.github.com/kfsone/5a44d68809d369423b9e93e001dbff91

aka (still running right now, and on a free-tier machine so the numbers are a bit higher)

https://colab.research.google.com/drive/13BbvYKpds6BFfac6C4eJNjGTIhl_R5J1?usp=sharing

Tromador commented 3 weeks ago

@ultimatespirit I feel I must point out that you are requesting/suggesting a huge amount of work. TD is old and much of the code comes from a time when the demands on the application were different, what could be done with Python was different and even best programming practices were different. You're not wrong, per se, just understand that you are talking about major rewrites to the heart of the code. For some years it's just been @eyeonus holding it together with duct tape and string, so if some of the decisions seem strange it's because it's what needed to be done to keep things running at all in the changing landscape with the limited time he had, with whatever support I could give in terms of hosting a server. A major rewrite might well be a very good thing, just bear in mind that @eyeonus is also trying to study for a degree in computer science at the same time and can only do so much.

ultimatespirit commented 3 weeks ago

On how to structure tables - replacing the 8 flags (has market, has black market, etc) with two 8-bit flag masks might be a good idea, but temper against how expensive the bit-twiddling will be from within python.

From what documentation I've seen, SQL is supposed to support bitwise operations so at least a query like "find things that have/don't have X" should be achievable on the SQL side with something like ...WHERE flagMask & bitMask != 0. If you have a dedicated database in a strong relational language, and your task mainly requires querying that database for specific parameters (but said parameters don't really factor into the rest of your algoirthm, like the padSize shouldn't matter after you've queried for things with the appropriate pad size you want, by construction you know the data is correct), it makes sense to just keep that in SQL rather than trying to re-implement it in Python. At least, I can't think of any reason any of those bit fields would matter for the actual trade route optimisation problem after you get your initial dataset.

Of course, if it ends up being that that sort of information is required / the database is just meant to be a datastore, may make sense to stop paying the cost of SQL and go with a NoSQL key-value store solution that would read/write faster and involve less complicated search queries (at the cost of likely more storage space for the database).

And @Tromador , I think you may be misreading my tone? My intention for this one specifically is to essentially just document every place that this project could potentially squeeze out space savings / bandwidth savings, since I saw it being discussed in other issues and I read through the data formats being used here. Treat this as a documentation document, it's a GitHub issue solely because that's the best way to document things in GitHub. If it's deemed unnecessary or out of depth, it's free to be ignored.

None of it is a condemnation of past decisions, it's not like any codebase is ever perfect from the get go. Nor is there any expectation of continued support. That being said, as you said TD is old. And unfortunately, the cumulative burden from that past inertia / accumulated issues is more than starting to show. As it currently stands there's very little reason to use TD beyond as a curiosity, due to the noted performance issues. And to me at least, that's really unfortunate as this project seems really cool. So if I can document (or fix) performance gaps, I figured why not, especially when it seemed to suddenly get such a flurry of attention recently.

eyeonus commented 3 weeks ago

If you understand the difference between a full backup and a differential backup, then you understand the difference between listings.csv (full) and listings-live.csv (differential)

Every day, Spansh dumps all the collected data for every system with at least one station in it to galaxy_stations.json, which uncompressed is ~9GB. The source Spansh uses to collect the data it stores is undoubtedly EDDN.

In the period between each dump, as EDDN broadcasts new market messages, the server updates the DB from those messages. The data from these messages will be in the next Spansh dump, but in the meantime, listings-live.csv

When I originally wrote the listener, I did it with the intention of people running it themselves, as until Tromador offered, there was no option for having a TD server post-Maddavo. This is why the from_live is in the DB. In fact, the next thing on my TODO list is to revamp the client-side code of the listener with the recent changes to TD in mind.

kfsone commented 3 weeks ago

From what documentation I've seen, SQL is supposed to support bitwise operations so at least a query like "find things that have/don't have X" should be achievable on the SQL side with something like ...WHERE flagMask & bitMask != 0. If you have a dedicated database in a strong relational language,

The twist here is whether you want to index it or not, and if you do, then you have to pay the indexing overhead. For a long running process, you typically want to offload that to another thread/process, so it doesn't block user operations; that's where sqlite+python was a poor long-term choice, and one I clung to until the moment I discovered that Python threading was worse than a single thread for cpu-bound work (*if you are reading this after Python 3.13, go back to your future where Aubrey Plaza is president)

ultimatespirit commented 1 week ago

Every day, Spansh dumps all the collected data for every system with at least one station in it to galaxy_stations.json, which uncompressed is ~9GB.

Didn't have time earlier to double check this, I see that the spansh_plug.py plugin only downloads galaxy_stations.json. Same for the tradedangerous_listener.py code. Spansh actually makes an even smaller more targeted set of data available [1] that chronicles solely the systems updated within the last 24 hours. Considering the eddblink backing server runs the spansh plugin (at least, that's what it sounds like it does) every day, it would likely be an immediate win to include some logic to download this targeted file, filter out the systems that have stations from it, and update the DB to the latest spansh that way, rather than downloading the 1.5GB all stations dump everytime.

I think spansh has offered in the past to include other types of data dumps too, so it could be possible to reach out and just get a dump created that's solely the systems with stations updated within the last 24 hours. On the eddblink side these daily subsets could be converted into listings.csv patches and stored, cutting down on having to resend the full listings.csv every day for users who update daily (which is one of the things I detailed above as reducing duplication work).

(Of course, since spansh's data is definitely EDDN, if the EDDN listener is running on the server constantly anyway there shouldn't be any missing data necessitating the server download spansh to begin with, it'd just be a replication of what it was listening to anyway. Nice redundancy though I guess to double check no messages were missed.)

[1] https://downloads.spansh.co.uk/galaxy_1day.json.gz (don't actually click that, it'll go straight to download of a 600mb file or so)