EDDB will soon cease operations

bgol commented 1 year ago

In case you didn't notice: https://forums.frontier.co.uk/threads/eddb-a-site-about-systems-stations-commodities-and-trade-routes-in-elite-dangerous.97059/page-37#post-10114765

kfsone commented 2 months ago

Just want to thank you guys for keeping this alive for the players; hats off again to eyeonus.

eyeonus commented 2 months ago

I couldn't have done it without the help and support of everyone else involved, most especially Tromador, who hosts the server for downloading updates.

Speaking of which, I think we've got the issues sorted out with having the server use the new spansh plugin, so we should be seeing the server back up and running soon.

spansh doesn't work on a brand new install because of, among other things, erroring when trying to run load_known_space() with no Station table to pull from, so that still needs to be addressed, but in the meantime, running trade import -P eddblink -O clean will build the initial files needed for the spansh plugin to work properly.

Not a big deal for clients, as once the server is up and running, they can keep everything updated using the eddblink plugin just like before, but it's important for the server, since eddblink pulls from the server to do the initial build.

Current TODO list:

update documentation for the spansh plugin to include information about the use of the maxage option added in v10.15.0
update spansh plugin to detect when it is being run on a new install and build the DB rather than just borking
work on making the GUI feature-complete as compared to TradeDangerous Helper

Tromador commented 2 months ago

Just want to thank you guys for keeping this alive for the players; hats off again to eyeonus.

Nice to hear from you Oliver. It's been a long time. Hope you are well.

Now if you would just like to refactor the DB import for us, so it's less memory intensive on multi GB .prices files. 😏 That would be super.

eyeonus commented 2 months ago

Specifically cache.processPricesFile()

kfsone commented 2 months ago

Have you done any profiling for where the bottlenecks are? A quick glance at processPricesFile and I can see it won't be an easy quick-look.

Thoughts:

for a multigb prices file, the little stuff moves up in decimal places and bites you. For instance, take the number of lines in the file and multiply it by 30ns -- the overhead of calling a method in python is about that, assuming no branch misses.

That's 30ms extra runtime per function in the codepath, on top of whatever actual workload the code has.

You're doing a text.find and a split/join to deal with multiple spaces, which makes the strip redundant.

I suspect the find / join / split is probably expensive, and I'm not sure why you need to do it - just use \s+ in the regex patterns? But if there's a reason I missed giving it a few seconds glance, consider this instead:

space_cleanup = re.compile(r'\s{2,}').sub
for line in priceFile:
   # replace:
   #  text = text.strip()
   #  if not text: continue
   #  replace whitespace with single spaces
   #  if ...

   # with
   text = space_cleanup(line, ' ').strip()
   if not text: continue

this is less about the comparative performance gain and more about memory pressure; the current approach splits the text into lots of small strings and then builds a new string by adding them all back. the regex approach combines the search with the mutation and does it efficiently in C inside the re library.

I also think it might be worth looking at one of the alternative ways of accessing SQLite - I've had my ass handed to python<->sqlite a lot lately.

apsw can be significantly faster while still being low level,
peewee or sqlalchemy come with some overheads that they pay for brutally by optimizing the how the database is abused

Back when I wrote the ancestor of this:

            addItem((
                stationID, itemID, modified,
                demandCr, demandUnits, demandLevel,
                supplyCr, supplyUnits, supplyLevel,
            ))

that was possibly the best way to do it, but it might be worth using dataclasses with slots.

Also, the Python I was using originally had it so the optimal way to join two strings was to use join. This is no-longer true, and using f-strings may significantly help in a few places in the cache.py code.

- facility = "/".join((systemName, stationName))
+ facility = f"{systemName}/{stationName}"

Here's a colab notebook I made poking around some thoughts, but I'll try and take a bit more of a look in the next few days.

https://colab.research.google.com/drive/12qQJSVwv3_basV8b5bdfbMzQRMzgpLQk?usp=sharing

kfsone commented 2 months ago

for profiling: consider using pyinstrument or the MSVS Python performance analyzer is an excellent profiler. I tend to use pipx to run pyinstrument rather than installing it.

https://github.com/pypa/pipx https://github.com/joerick/pyinstrument

eyeonus commented 2 months ago

It is definitely a RAM bottleneck. The server has 10GB RAM, and the prices file generated by spansh without using the maxage option comes out to nearly 6GB. Since that entire file is loaded into memory by the prices processor, and the entire list of items to insert or replace generated by processPrices is also loaded into memory, the combined weight of those two giant lists exceeds the RAM of the server. It's not a problem I encountered on my 64GB RAM box.

spansh commented 2 months ago

I was somewhat surprised at the speed because it looked like you were processing less than 2 stations per second. If it's a memory issue I would try a different way. I would never recommend processing the entire file from memory, I personally use RapidJSON and stream the file decompressing it and parsing it on demand.

I'm aware that not every parser is capable of doing that so I created the files in such a way that it is easy to simulate such a mechanism.

Using the below script you can have the file broken up into single lines of one JSON document per system per line, essentially simulating line delimited JSON.

gzip -dc galaxy_stations.json.gz|sed '1d;$d;s/,$//'|<your script here>

That way you only need to keep one system in memory at a time.

kfsone commented 2 months ago

It is definitely a RAM bottleneck. The server has 10GB RAM, and the prices file generated by spansh without using the maxage option comes out to nearly 6GB. Since that entire file is loaded into memory by the prices processor, and the entire list of items to insert or replace generated by processPrices is also loaded into memory, the combined weight of those two giant lists exceeds the RAM of the server. It's not a problem I encountered on my 64GB RAM box.

If you can run the process thru pyinstrument to get data on that it would help -- it might be useful to refactor the import into a generator pattern feeding the sql operations.

I'm a little confused - spansh is talking about json rather than the old .prices format I made for the transition from the excel spreadsheet TD started as to code :)

And - to catch me up - is this a dedicated import process rather than part of another operation? Is it something specific to the server or applicable to general users?

eyeonus commented 2 months ago

It is definitely a RAM bottleneck. The server has 10GB RAM, and the prices file generated by spansh without using the maxage option comes out to nearly 6GB. Since that entire file is loaded into memory by the prices processor, and the entire list of items to insert or replace generated by processPrices is also loaded into memory, the combined weight of those two giant lists exceeds the RAM of the server. It's not a problem I encountered on my 64GB RAM box.

If you can run the process thru pyinstrument to get data on that it would help -- it might be useful to refactor the import into a generator pattern feeding the sql operations.

I'm a little confused - spansh is talking about json rather than the old .prices format I made for the transition from the excel spreadsheet TD started as to code :)

And - to catch me up - is this a dedicated import process rather than part of another operation? Is it something specific to the server or applicable to general users?

We have a new plugin, which downloads the galaxy-stations.json from https://spansh.co.uk/dumps, converts that into a .prices file, and then imports the resulting file using cache.importDataFromFile()

I think @spansh was misunderstanding where the slowdown is coming from, as @lanzz did write the spansh plugin to stream the json when making the .prices file

It took my machine with 64GB RAM 30 minutes to import it into the DB.

Tromador's server, with 10GB RAM, took 15 hours to do the same thing, maxing out RAM and swap utilization the entire time. I ran a VM with 10GB RAM available to it and had the same issue.

I'll do the pyinstrument thing in the VM for you.

kfsone commented 2 months ago

Is the conversion-to-prices an integration or compatibility thing?

When I was first writing the store code, python's json implementation would take a pair of donkey balls, connect them up to a life-support system, clean them, groom them, talk to them, feed them caviar and champagne for a few months, make them feel a deep connection to its soul, then delicately sauteing them in a blend of 21 spices in a hand-crafted clay pot for a week before finally sucking them so hard a black hole would blush. "It sucked" just doesn't cover it :)

They replaced it with a fairly decent C-based implementation. There are other more performant implementations for python but they're not ideal for in-the-wild use due to potential data abuses (you really don't want to use them importing foreign data on your server)

kfsone commented 2 months ago

Another issue I think I see is that processPrices was intended to be a generator method, so there was no list building, but it seems to be a list builder now. That's swings and roundabouts. With python 3.9+ a generator might be slightly slower for smaller datasets, but list building is O(NlogN) or O(N^2) because as the list grows python has to allocate a new, blank list, and move all the elements from the old list into, to replace it. As the list grows, this gets slower and slower, and you'll start ending up with massive memory wastage because they've made Python less-aggressive about garbage collection.

As an experiment try doing this:

import gc
in processPrices->changeStation add:

gc_start = time.time()
# space this out across time, so it'll only do this for stations every 30 seconds.
if int(gc_start) % 30 == 0:
  count = gc.collect()
  gc_time = time.time() - gc_start
  print(f"gc collected {count} in {gc_time:.2f}s")

and then try importing a big import and see if you get any ludicrous gc blips. that will suggest we need to find ways to reduce object turnover.

eyeonus commented 2 months ago

Is the conversion-to-prices an integration or compatibility thing?

@lanzz wrote the plugin, so I don't know. My guess is it was done the default import uses prices files.

kfsone commented 2 months ago

@eyeonus I also suspect there's some hurt coming from SQLite. Take a look at https://rogerbinns.github.io/apsw/pysqlite.html In particular, esp given the origins of the particular sqlite code 😳it might be interesting to use https://rogerbinns.github.io/apsw/bestpractice.html#module-apsw.bestpractice

When I was experimenting with trying to manipulate ~2gb of asset data for my current job and tried SQLite itself, I found apsw shaved between 5 and 50% off the performance costs of sql operations without me doing anything.

Which was nice.

kfsone commented 2 months ago

Is the conversion-to-prices an integration or compatibility thing?

@lanzz wrote the plugin, so I don't know. My guess is it was done the default import uses prices files.

I haven't looked at td in so long I didn't even know it had plugins!

If I'm reading the code correctly, the spansh plugin actually imports the data into the sqlite as it goes, but then also writes a .prices file too; I think that was part of the janky state that arose from the excel-to-sqlite transition, where I wanted to be able to recreate the sqlite database from source any time in-case something went pearshaped, so you kinda had to always regenerate the .prices file after updating the database?

Is that correct?

eyeonus commented 2 months ago

Sort of. If it finds a station, system, or commodity in the json that isn't in the DB, it'll insert that into the Station, System, ar Item table as appropriate and commit on the fly, but the actual listings themselves only get put into the .prices file.

kfsone commented 2 months ago

would it be preferable to have the plugin do the whole job; I guess we'll still have to generate the prices file at the end, but we do that anyway,

read json -> write prices -> read prices -> update db -> write prices vs read json -> db -> write prices

also, I figured I'd spin it up and try running it locally and wondered what it was doing, so I may have shaved a few minutes off the initial import process, but I don't know if I'm maybe making it worse for lower-memory systems.

kfsone commented 2 months ago

The spansh plugin is spending a lot of time just creating the prices file, and taking seconds plural to write some systems. I ran it with an if systems > 2000: sys.exit(0) and ... it took 2000 seconds.

Oh: this seems to be a huge cpu drain - it is in the "ensure_commodity" code, and it updates ALL items. It looks like the order should be calculated once somewhere and then set on the item as it is added.

If you comment this out, it generates the .prices file about 200-2000x faster on my machine, but maybe this is just because I'm still doing my initial import.

        # Need to update ui_order
        temp = self.execute("""SELECT
                        name, category_id, fdev_id
                        FROM Item
                        ORDER BY category_id, name
                        """)
        cat_id = 0
        ui_order = 1
        self.tdenv.DEBUG0("Adding ui_order data to items.")
        for line in temp:
            if line[1] != cat_id:
                ui_order = 1
                cat_id = line[1]
            else:
                ui_order += 1
            self.execute(
                """
                UPDATE Item SET ui_order = ? WHERE fdev_id = ?
                """, ui_order, line[2],
            )

Also note that 'self.execute' is using a short-lived 'cursor', so it is committed once the cursor goes away.

Tromador commented 2 months ago

The server chews through the json->prices in just a few minutes. Sure quicker would be nicer, but it's the subsequent import of the prices which takes 15 hours. That's the real problem, 10 minutes of converting the json is trivial in comparison.

eyeonus commented 2 months ago

Oh: this seems to be a huge cpu drain - it is in the "ensure_commodity" code, and it updates ALL items. It looks like the order should be calculated once somewhere and then set on the item as it is added.

That only gets executed when inserting a new commodity to the Item table, so it shouldn't happen very often. In fact, if you build an initial DB using trade import -P eddblink -O clean,solo, it should only happen at most once, because there's only one commodity, a rare item from a megaship, that isn't already in the DB at that point.

kfsone commented 2 months ago

Doh, I wasn't specific: the ensure_commodity issue is the spansh plugin, not eddblink. I sped up the eddblink import because running the spansh plugin without eddblink first didn't seem to work.

I've pushed my progress so far to https://github.com/kfsone/Trade-Dangerous on the dev/kfsone-optimizations branch.

kfsone commented 2 months ago

The server chews through the json->prices in just a few minutes. Sure quicker would be nicer, but it's the subsequent import of the prices which takes 15 hours. That's the real problem, 10 minutes of converting the json is trivial in comparison.

There's several things going on there, tho: 1- simdjson is fast, but it has to allocate a new parser for each line iteration because the objects it generates are lingering, which builds up a massive amount of objects and creates memory pressure, and python 3.8ish got optimized for short-lived processes and to be lazy about garbage collection, so simdjson is creating a larger amount of performance overhead than it is saving, 2- the question I'm doing a piss-poor job of asking is: why are we writing the prices file first, other than that .prices is how td usually imports data.

generating the prices file is also adding to memory/gc pressure, and parsing .prices is frighteningly more expensive than parsing json.

Like: I think the solve here is to break import into an abstract "tell me what the values are" with a default .prices front-end. Then, spansh would simply be an alternate front-end to that which doesn't have to go through the .prices file.

Looking at initial pyinstrument data, the vast majority of the CPU use during the early parts of the spansh import are being spent in datetime.strptime.

Might be worth seeing if that can be replaced with https://github.com/m1so/fastdatetime or if it isn't possible to avoid the human-readable forms.

It might be worth asking the question: How much priority should human-readability get in the backing format get at this point.

I chose this format as a way to make it easy to manually input data.

It might be worth adapting it so that it uses float utc time.time() values rather than going through datetime when you are not expecting a human to manipulate the data. When someone wants to edit a station or system, then do the extra work, but otherwise avoid datetime at all costs :)

kfsone commented 2 months ago

If you run the spansh import from my fork/branch, it tells you how long systems and stations are taking to parse, and that time starts (on my machine) at about 22ms per system, 4ms/station. But it climbs quickly, while writing the prices file. My branch uses ijson instead of simdjson, which is technically slower except that it ends up using less memory and causing less aggressive gc.

But somehow, the process of writing the prices file is still ticking away at nearly 75ms per system by the time I get to 10,000 systems written. It shouldn't take anywhere near that long to be doing this.

kfsone commented 2 months ago

eyeonus commented 2 months ago

2- the question I'm doing a piss-poor job of asking is: why are we writing the prices file first, other than that .prices is how td usually imports data.

I don't know why it was written to create a prices file, as I didn't write it.

The reason it still does that is because I'm busy fixing other stuff.

kfsone commented 2 months ago

Just playing catchup - didn't know if there were more complex dependencies or if you and spansh had just had to work around my early-days penchant for the human-edit file.

eyeonus commented 2 months ago

I understand. I'm just saying that if you want to know why the spansh plugin writes a prices file, the only person that can answer it is @lanzz, who wrote the plugin.

kfsone commented 2 months ago

Roger that - just trying to get my bearings. Looking around the code, I'm thinking the following, slap me if I'm being stupid (so, have your hand ready, I'm often stupid):

The plugin system was supposed to be a way for exchanges like maddav's and spansh's to "plug in" - literally to feed the db backend directly.

But there's a lot of massage and curation that needs to happen that I didn't set up the tradedb part to do reliably - so, just writing a .prices file ends up being a very-strongly-defined interface and an easy way for someone writing a plugin to get the job done.

It sucks as a solution because the json loads like lightning, and with ijson introduces less bottlenecks/ram use. But in the case of spansh's plugin, he's done the most obvious and logical thing - there's no criticism of anyone except me here - and he's loading the data in, reading the DB to find his bearings, and then translating it into .prices format, from which we then do a second layer of translation which is freakishly expensive.

Seriously - the .prices format is only like that because it looked almost exactly like the market screen back then.

"@ SOL/IKDEA" <- using '@' (at) in the literal sense "+ Food" <- each category had a +/- button to expand/collapse, and if you edited them in sublime text or notepad++ it would let you collapse the indented child section just like the UI :) " Nothing to see here 0 100 0? 0? 2024-04-20 04:20:20"

I mean .. wth, Oliver, there are spaces in the name. That's gonna make parsing expensive. And human-readable dates are nice and human readable but there an abomination for machine readability :)

So -- I think I need to add and make it easy and obvious to use a better pure data-api over TradeDB, and convert the .prices format into the default, primary user of that, while adding a second, json-or-json-like version, or a yaml form (yaml would be a terrible choice for large dumps because of the parsing overhead, but great for smaller / human-facing dumps).

It needs to be surfaced as a few aspects:

backend data access (get * from stations)
schema'd data access, (worries about caching, adding a system when you add an item for a station in it),
data curator/normalizer (handle things like mk -> Mk.),
translator plugin

No data ever gets in without going through a translator plugin - it just can't; but the translation might be incredibly cheap the closer it is to our schema.

If you export your local db, reset, and clean-reimport locally, it can skip the curation for speed; curation should be pluggable and mixable, so that you don't - for instance - have to apply all the defunct maddav filters when importing from spansh, and you don't have to apply eddb fixes when reimporting your own local td prices dump, etc.

There's a chunk of that work I don't feel like reinventing wheels for, so I may take a look at having gemini/devin/chatgpt tear me up an SQLAlchemy with Alembic implementation of the current schema. From that, I can apply some learnings from working with python and larger data sets to optimize the schema in ways that will benefit run-of-the-mill trade calculation startup times etc.

It will also make it possible for power users to switch backend. SQLite stays default supported backend, but with the ability to use PostgreSQL or MySQL if you're running td on a server...

Not the motivation, the motivation is that working thru SQLAlchemy (or pony or peewee, possibly with apsw) should greatly simplify the database-facing code. Instead of writing raw SQL queries, you write fairly naive sql-like python code that the ORM will translate into way more efficient sql operations based on the back end.

We're not dealing with dynamic runtime queries - we're dealing with fairly simple and consistent patterns for shaping a data store we use for bulk-loads rather than actual SQL queries. So the usual overheads/concerns I'd have about using an ORM don't apply - this is what they're actually well suited to.

If any of this sounds wrong, lmk - I'm the old fart here, you've been fighting in the trenches and dealing with things so don't assume I know anything....

Tromador commented 2 months ago

As a random note, massive thanks Oliver for coming back and revisiting this old project. Eyeonus has been doing super work, but "I don't know why Oliver did that, you'd need to ask him" is a phrase that comes up moderately often.

You were talking about server Sql solutions. I am more than happy to provide mariadb (as the open mysql port is now known) or postgres or even some other non sql type daemon powered db solution, but my understanding is you would have to go through the code and convert (or some kind of ifdef for each implementation) the sql calls to comply with the idiosyncrasies of each particular flavour of db solution. You kinda make it sound like it could be very much simpler. Have I misunderstood somewhere?

On Sun, 21 Apr 2024, 04:20 Oliver Smith, @.***> wrote:

Roger that - just trying to get my bearings. Looking around the code, I'm thinking the following, slap me if I'm being stupid (so, have your hand ready, I'm often stupid):

The plugin system was supposed to be a way for exchanges like maddav's and spansh's to "plug in" - literally to feed the db backend directly.

But there's a lot of massage and curation that needs to happen that I didn't set up the tradedb part to do reliably - so, just writing a .prices file ends up being a very-strongly-defined interface and an easy way for someone writing a plugin to get the job done.

It sucks as a solution because the json loads like lightning, and with ijson introduces less bottlenecks/ram use. But in the case of spansh's plugin, he's done the most obvious and logical thing - there's no criticism of anyone except me here - and he's loading the data in, reading the DB to find his bearings, and then translating it into .prices format, from which we then do a second layer of translation which is freakishly expensive.

Seriously - the .prices format is only like that because it looked almost exactly like the market screen back then.

"@ SOL/IKDEA" <- using '@' (at) in the literal sense "+ Food" <- each category had a +/- button to expand/collapse, and if you edited them in sublime text or notepad++ it would let you collapse the indented child section just like the UI :) " Nothing to see here 0 100 0? 0? 2024-04-20 04:20:20"

I mean .. wth, Oliver, there are spaces in the name. That's gonna make parsing expensive. And human-readable dates are nice and human readable but there an abomination for machine readability :)

So -- I think I need to add and make it easy and obvious to use a better pure data-api over TradeDB, and convert the .prices format into the default, primary user of that, while adding a second, json-or-json-like version, or a yaml form (yaml would be a terrible choice for large dumps because of the parsing overhead, but great for smaller / human-facing dumps).

It needs to be surfaced as a few aspects:

backend data access (get * from stations)

schema'd data access, (worries about caching, adding a system when you add an item for a station in it),

data curator/normalizer (handle things like mk -> Mk. ),

translator plugin

No data ever gets in without going through a translator plugin - it just can't; but the translation might be incredibly cheap the closer it is to our schema.

If you export your local db, reset, and clean-reimport locally, it can skip the curation for speed; curation should be pluggable and mixable, so that you don't - for instance - have to apply all the defunct maddav filters when importing from spansh, and you don't have to apply eddb fixes when reimporting your own local td prices dump, etc.

There's a chunk of that work I don't feel like reinventing wheels for, so I may take a look at having gemini/devin/chatgpt tear me up an SQLAlchemy with Alembic implementation of the current schema. From that, I can apply some learnings from working with python and larger data sets to optimize the schema in ways that will benefit run-of-the-mill trade calculation startup times etc.

It will also make it possible for power users to switch backend. SQLite stays default supported backend, but with the ability to use PostgreSQL or MySQL if you're running td on a server...

Not the motivation, the motivation is that working thru SQLAlchemy (or pony or peewee, possibly with apsw) should greatly simplify the database-facing code. Instead of writing raw SQL queries, you write fairly naive sql-like python code that the ORM will translate into way more efficient sql operations based on the back end.

We're not dealing with dynamic runtime queries - we're dealing with fairly simple and consistent patterns for shaping a data store we use for bulk-loads rather than actual SQL queries. So the usual overheads/concerns I'd have about using an ORM don't apply - this is what they're actually well suited to.

If any of this sounds wrong, lmk - I'm the old fart here, you've been fighting in the trenches and dealing with things so don't assume I know anything....

— Reply to this email directly, view it on GitHub https://github.com/eyeonus/Trade-Dangerous/issues/110#issuecomment-2067875218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJGYLHVNAWX6BCCQBYVHZ3Y6MV6NAVCNFSM6AAAAAAWQEPTGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXHA3TKMRRHA . You are receiving this because you were mentioned.Message ID: @.***>

eyeonus commented 2 months ago

That all sounds super awesome. Trom's been complaining about how long it takes to regenerate TradeDangerous.prices whenever the cache is updated, so I'm sure he'd agree with me that an updated .prices format would be more than welcome.

On my end, there are also a few changes I want to make to the DB, such as removing the System.name, Station.name form RareItem, especially now there's a blasted megaship with a rare item, and instead using the station_id of the station.

I also want to fully move to using the fdev_id for everything that has one, which is basically everything? At a guess, I think the only tables that don't have things with a fdev_id are "Added" and "Category".

I"m currently working on a problem I found where the readPrices errors with a "STAR/System" not found message after spansh does its thing, which is weird because spansh should've added all the stations it puts in the prices file it makes into the DB, so tracking down what's happening is proving to be a headache.

eyeonus commented 2 months ago

Quite a few changes to TD, one of which is not doing the json -> prices -> DB thing in spansh, which helped with the RAM issue tremendously, but there is noticeable slowdown after a few thousand systems, so switching to ijson is probably a good idea.

kfsone commented 2 months ago

Hah, sometimes there's a good answer to "why did oliver do that", but there are usually only one of two reasons:

some esoteric performance consideration, which I would like to say "I tend to document those" snort - but those are often tempered by the fact that I went thru the mental process of "a = b + 1, seems fine, wait wtf this takes 3200x longer in python than another language, hells bells" (for example, the heavy use of writing methods inside methods in the import code, is because that's about where I learned that in python >calling< a method would often be more expensive than the work done in the function; think of it this way: the "marco, pollo" thing would have never caught on if his name was "juanamarcito-de-luz-jalfraisi-addarabicha-von-worthington pollo"),
i'm stupid,

If the reason is the first, and I didn't comment it, then see the second matter ;)

@Tromador yes, you're right, it would but that's why the solution is to move to an ORM like SQLAlchemy which abstracts that away. I would not have believed that it could deliver viable performance in Python but I have experiential basis to believe at this point it would actually work well for how TD uses the database.

And honestly, SQLite is a terrible storage engine for what we need.

@eyeonus I f'ing hate trying to modify database schemas, especially when your primary user isn't going to be server administrators with CI and support teams but people who didn't have python installed before you foisted it on them (and, I've learned some things about that at Super Evil - specifically, how to give our non-technical users like artists etc a binary .app or .exe that packages the code + specific python + libs + mods so that they can no-longer screw me over by being on a mac that they have broken system python on, installed an xcode python or three, used ports that one time which installed python 3.14 which isn't out yet, and then different brew dependencies have absolutely soiled everything beyond repair by having conflicting python-version-requirements).

This is why I'm thinking about switch-to-ORM. Specifically, SQLAlchemy.

Specifically:

Implement the CURRENT schema with its warts, zits, and "jesus oliver did you have a stroke?" flaws,
Enable Alembic schema management,
Deploy and ensure it's copasetic,
Fix the warts and zits and Alembic will bring the users up to date,
Not be trapped with the original SQLite schema because the thought of changing can scare flies from s**t,

https://docs.sqlalchemy.org/en/20/orm/quickstart.html

# pseudopy
with sqlalchemy.orm.Session(engine) as dbs:
  sol = System("Sol", coords=(0,0,0))
  for station in (("Mercury", 193.5), ("Luna", 510), ("Phobos", 722.4), ("Titan", 4810.11)):
    sol.stations += Station(name=station[0], ls_from_star=station[1])
  dbs.add(sol)  # <- notice what I'm *not* adding by hand?
  dbs.commit()

session = sqlalchemy.orm.Session(engine)

from sqlalchemy import *  # methods like select etc are from here
query = (
  select(Station)
    .join(System.system_id)
    .where(System.name == "Sol")
)
for station in session.execute(query):
   print(f"{station.system.name} has {station.name}")

session.stations.remove("Phobos")  # crashed into mars
session.flush()

print(f"Oops, sol now has {session.select(Station).where(system_id == session.systems["Sol"].system_id).count()} stations")

Its really that last line above all else that's where we want to be - having the database reason about data in a similar way to how we expect to reason about it at the tradedb level - in terms of in-memory-object-like references, that means we can quit sweating SQLite minutae during imports etc and see what the code is doing, rather than trying to read around how it's trying to manipulate the database into doing what it needs so it can to what it thinks it wants to do.

kfsone commented 2 months ago

The other option would be to ditch an SQL Database entirely and use something more like a key-value store for the storage, and then entirely require everything to go through tradedb in-memory.

I don't see (perhaps for not looking hard enough) any indication that anyone is really actually using the SQLite database. I added it because I wanted to focus on building TD rather than researching storage options, and it was nice to have a pre-existing command line to view the data when TD itself didn't yet have a way to query a particular axis.

Rather, I think that people have resorted to using the sqlite3 api because they already know it, and anything you can do thru the tradedb interface you can also do by replicating the sql the tradedb api would eventually do, and if you know sqlite3 why bother using a different api?

Unfortunately, we're not really using sqlite, we're just using ".db storage": so - as almost every plugin has discovered to its authors annoyance - our data doesn't quite work like that (it's not ok to auto-commit every insert, update, or select; it's not great to be blindly doing update on every field and forcing the engine to rewrite entire records instead of updating them, etc, etc - all of which are low-level implementation details that tradedangerous touches on due to it's weird state of being a non-using-user of sqlite)

kfsone commented 2 months ago

I want to second Trom's applause to eyeonus: a- I'm not here to stake any kind of claim, unless it's idiot-in-corner-with-dunce-hat-for-why-did-you-do-that.

I have a tendency to speak/type in absolutes; it's not because I think I know everything, but it probably reads that way; It's not because I think I know better, again probably reads... It's because at 52 I find that if I stop to question everything I'm uncertain about, I seem to do it in a way that makes other people think I'm questioning them rather than my knowing, and also I never get anything done :)

You've done and ar edoing a rockstar job, eyeonus :)

aadler commented 2 months ago

Hear Hear! Sent from my iPhone

On Apr 21, 2024, at 10:36 PM, Oliver Smith @.***> wrote: You've done and ar edoing a rockstar job, eyeonus :)

Tromador commented 2 months ago

At 55 can I claim to be the oldest fart? Also I claim the title of biggest glutton for punishment, given I remember fixing some of Oliver's mistakes when TD was shiny and new, and have been with the project ever since.

On Sun, 21 Apr 2024, 20:36 Oliver Smith, @.***> wrote:

I want to second Trom's applause to eyeonus: a- I'm not here to stake any kind of claim, unless it's idiot-in-corner-with-dunce-hat-for-why-did-you-do-that.

I have a tendency to speak/type in absolutes; it's not because I think I know everything, but it probably reads that way; It's not because I think I know better, again probably reads... It's because at 52 I find that if I stop to question everything I'm uncertain about, I seem to do it in a way that makes other people think I'm questioning them rather than my knowing, and also I never get anything done :)

You've done and ar edoing a rockstar job, eyeonus :)

— Reply to this email directly, view it on GitHub https://github.com/eyeonus/Trade-Dangerous/issues/110#issuecomment-2068175873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJGYLEWULTYJCDDXLPDF23Y6QINNAVCNFSM6AAAAAAWQEPTGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYGE3TKOBXGM . You are receiving this because you were mentioned.Message ID: @.***>

eyeonus commented 2 months ago

With TD 10.16.9, spansh plugin is feature complete with the old json -> prices -> DB method, but goes much much faster and without anywhere near the RAM usage, the server is up and puttering without complaints again, and eddblink plugin works for clients just like in the EDDB days.

It took a lot of effort to get to this point, and I appreciate the support from everyone.

I'm apparently the young pup at 43.

We still need to figure out how to populate the other tables. At this point, we have Added, Category, and RareItem templated, and we have Item, Station, StationItem, and System built using the spansh data. The rest are empty until some means of generating them using the galaxy_stations.json dumps via the spansh plugin and/or the EDDN messages via the listener, preferably both.

I'm not sure which would be easier to implement, but I'm taking a bug-fix only much needed break for at least a week, so if anybody wants to take a crack at it, I will kiss your babies.

eyeonus commented 2 months ago

BTW, what are FDevShipyard and FDevOutfitting?

what's the difference between StationItem and StationSelling/StationBuying?

The latter are views on the former.

kfsone commented 2 months ago

BTW, what are FDevShipyard and FDevOutfitting?

what's the difference between StationItem and StationSelling/StationBuying?

The latter are views on the former.

Could. Not. See. "VIEW", and my machine went absolutely insane when I tried doing a "COUNT()" on StationBuying. Had to physically power it off. Lol.

eyeonus commented 2 months ago

I had a similar issue of not seeing something obvious in retrospect earlier today, so I totally understand.

bgol commented 2 months ago

BTW, what are FDevShipyard and FDevOutfitting?

As the comment in the SQL says, they are for mapping purposes between TD and EDDN. Because in the early days anybody was using it's own naming as there was no access to the Frontier API or a journal. https://github.com/eyeonus/Trade-Dangerous/blob/release/v1/tradedangerous/templates/TradeDangerous.sql#L259-L274

The edcd plugin can also be used to keep them in sync.

eyeonus commented 2 months ago

I did not even see that. Thanks!

kfsone commented 2 months ago

It's like the four doctors in here :) :wave: @bgol :)

bgol commented 2 months ago

Yeah, hi Oliver, nice to hear from you. Now, where is madavo? :) (Didn't expect this issue to become on old guys chat ;)

lanzz commented 2 months ago

Sorry I haven't been following up the developments in this thread.

Yes, the reason I implemented it to generate .prices file was because that seemed like what the plugin system itself was expecting. import_cmd.run() allows the plugin to either pass control back by returning True, or to cancel the default implementation by returning anything else, so it seemed like the plugin was expected to provide some loading up to a point and then leave the actual import to the default implementation, so that's what I went with. I also did not want to reproduce lots of complexity in updating the database directly, as I couldn't immediately find any nicely reusable way to do that in the existing logic (I might not have looked too hard).

aadler commented 2 months ago

Yeah, hi Oliver, nice to hear from you. Now, where is madavo? :) (Didn't expect this issue to become on old guys chat ;)

Chiming in for the 50+ crew 👴

eyeonus commented 2 months ago

It's like the four doctors in here :) 👋 @bgol :)

Lol, only if I'm Tennant. Also bowties are not cool.

Sorry I haven't been following up the developments in this thread.

Yes, the reason I implemented it to generate .prices file was because that seemed like what the plugin system itself was expecting.

No worries. I've changed it to go directly into the database myself, we all kind of assumed that's the reason why you did that. The RAM usage was just too much for the server doing it the way you had it.

Returning False means that the plugin handled everything itself. There's no expectation either way, it's just to give the plugin author the ability to go either way.

Your work is sincerely appreciated, after all, you did the hard part, all I did was some refactoring to make it less RAM intensive.

kfsone commented 2 months ago

@lanzz the simdjson usage seemed to run into a problem I had at Super Evil recently with our python-based asset pipeline and recent optimizations in CPython so that it doesn't garbage collect so often. As you'd spotted, you had to allocate a new simdjson parser each loop or else it complained you had references. Also, the td code is full of make-a-string-with-joins because that was the optimal way to join two short strings back in those versions of python. Now it seems to be bad because with the aforementioned garbage-reduction, the likelihood python will actually have to allocate an object for the string is extra high. le sigh.

That got me to looking at ijson, and I have a local version using it that is clearly slower than simdjson for small loops, but starts catching up by the page (4kb).

(please note: I live in a near perpetual state of 'wtf python, really, why?' these days -- if any of that seems to be pointed at anything but python and/or myself, I derped at typing)

spansh commented 2 months ago

We still need to figure out how to populate the other tables. At this point, we have Added, Category, and RareItem templated, and we have Item, Station, StationItem, and System built using the spansh data. The rest are empty until some means of generating them using the galaxy_stations.json dumps via the spansh plugin and/or the EDDN messages via the listener, preferably both.

If you let me know what the missing/extra data is I can point you to the fields if they're available in the dump and/or where I normally source that data from when/if I put them into my search index if they're not.

eyeonus commented 2 months ago

@spansh All the tables not currently being populated:

TABLE Ship:
    ship_id  = fdev_id, 
    name, 
    cost
TABLE ShipVendor: 
    ship_id = fdev_id ( ref: SHIP.ship_id ), 
    station_id = fdev_id ( ref: STATION.station_id ),
    modified = format( "YYYY-MM-DD HH:MM:SS" )
TABLE Upgrade: 
    upgrade_id = fdev_id, 
    name, 
    weight, 
    cost
TABLE UpgradeVendor: 
    upgrade_id = fdev_id ( ref: UPGRADE.upgrade_id ), 
    station_id = fdev_id ( ref: STATION.station_id ), 
    cost, 
    modified = format( "YYYY-MM-DD HH:MM:SS" )

It'd also be nice to have a way to automatically add new rares to the RareItem table:

TABLE RareItem: 
    rare_id = fdev_id, 
    station_id = fdev_id ( ref: STATION.station_id ), 
    category_id = ( ref: CATEGORY.id ), 
    name, 
    cost, 
    max_allocation, 
    illegal = ( '?', 'Y', 'N' ), 
    suppressed = ( '?', 'Y', 'N' )

eyeonus / Trade-Dangerous

EDDB will soon cease operations #110