eyeonus / Trade-Dangerous

Mozilla Public License 2.0
96 stars 31 forks source link

SEVERE performance issues doing EDDBlink import when DB is large. #126

Open eyeonus opened 2 months ago

eyeonus commented 2 months ago

Not sure what the problem is, but there's a huge difference in processing time when doing an import using EDDBlink plugin when the database file is large vs. small. To whit:

TradeDangerous.db file size: 6.5 GiB listings-live.csv file size: 10.3 MiB Time to completion: ~24 minutes

NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 11:14:51.958359
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 11:38:17.200988

versus: TradeDangerous.db file size: 43.7 MiB (empty StationItem table, otherwise identical to above database) listings-live.csv file size: 10.3 MiB (Same file as above) Time to completion: ~7 seconds

NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 12:20:00.816731
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 12:20:07.871285

Everything is exactly the same in both cases except for the size of the StationItem table in the database being imported to.

eyeonus commented 2 months ago

The problem is cumulative: During testing, importing listings.csv, a 2.5 GiB file, starting with an empty database, getting from 10% to 12% took under a minute, getting from 70% to 72% took just under 10 minutes, in the same import.

Tromador commented 2 months ago

Can I suggest we expand this concept:

The server side works. The client side works (apart from one warning, because a Rares station is in lockdown and doesn't appear in market data).

It works.

So I suggest concentrating on optimization, memory use in any case where that remains an issue and then speed rather than adding any new features. Draw a line under it, call it feature complete until we have something which can process imports and output runs in a more reasonable timeframe.

kfsone commented 2 months ago

I saw we recently enabled incremental vacuuming and some other sqlite options to improve performance? It's probably worth turning those off locally and investigating SQLite's features for reporting on what the vacuum will do (keywords: sqlite3, pragma, vacuum, analyze, optimize) which may explain the issue - maybe there's an index that needs tweaking or scrapping.

You also made it more aggressive about flushing the cache (closing the db, even) yesterday, to counter the memory bloat that my simple executemany was causing. So this make sit smell like a transaction issue - either an absence of them at some point or an unexpected nesting of them.

Look for manual transactions (execute "BEGIN"/"COMMIT") and switch those to context-managed cursors so that the sqlite3 lib isn't taken by surprise.

I also strongly recommend looking at the apsw module; it's designed as a drop-in-replacement for sqlite3 but less of it is implemented in python so it has a potential to be faster, but I have not bench'd it with our particular use case.

kfsone commented 2 months ago

https://stackoverflow.com/a/17794805/257645 this is intriguing, it would potentially give us a great way to get the data into the database fast and let sqlite worry about finalization, make it actually earn some of it's keep: https://stackoverflow.com/a/3167052/257645

I strongly recommend using executemany whenever there's a loop, if possible. If you have to cap it, that's fine, but you save a bunch of overhead.

Did you try running the imports with pyinstrument?

python -m venv venv
. venv/Scripts/activate.ps1   # or . venv/bin/activate on mac/linux
pip install -r requirements/dev.txt
pip install pyinstrument
pip install -e .
cd tradedangerous
pyinstrument ./trade.py import ...

at the end it will give you a bunch of blurb and you can customize how it does that, but I tend to run it like the above and then grab the session from the last two lines:

To view this report with different options, run:
    pyinstrument --load-prev 2024-04-24T10-19-42 [options]

and run something like:

> pyinstrument --load-prev 2024-04-24T10-19-42 -r html >speed.html  # local output
> start speed.html
# or
> pyinstrument --load-prev 2024-04-24T10-19-42 -r speedscope >speed.scope
# and upload that file here: https://www.speedscope.app/

sample speed.scope from an eddblink -O solo: speedscope.zip unzip, hit https://www.speedscope.app/ and drag it into the browse button. use the buttons in the top left to change type of view.

eyeonus commented 2 months ago

The only transaction being performed when starting from an empty database is:

        listingStmt = """INSERT OR IGNORE INTO StationItem
                        (station_id, item_id, modified,
                         demand_price, demand_units, demand_level,
                         supply_price, supply_units, supply_level, from_live)
                        VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )"""

Removing the "OR IGNORE" part doesn't have any effect.

eddblink used to use the executemany function after it finished gathering the list of all changes to be made, by appending the values for each change to the list listingsList within the for listing in listings loop, and then running executemany(listingsStmt, listingsList) after exiting the loop, which resulted in about 6 hours (on my machine) of nothing appearing to be happening while the database was being updated.

Changing it to what it is now, where each listing is inserted during processing with execute(listingStmt, {values}) still took about 6 hours, but at least there was some indication of progress.

The pragmas that were introduced knocked it down to a little over 2 hours.

eyeonus commented 2 months ago

I've been working on TD for a week straight now, during any free time I had outside of work and sleep. I'm taking a break for a while.

If it ain't a bug, it can wait or someone else can tackle it, is my current mode.

Tromador commented 2 months ago

The pragmas was me. WAL journal has a massive speed benefit on Linux, though seems less so on Windows and the vacuum keeps the db size to a minimum. The server is set up to vacuum every couple of days and I do want it to do incremental, as a full vac takes a chunk of time. That said, the first vac and optimise on the server's db reduced it from 6GB and change to 4GB and change so it's definitely worth it.

kfsone commented 2 months ago

I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting.

eyeonus commented 2 months ago

I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting.

I hope I didn't sound like I was feeling put upon. I'm just so tired from working on this for a week solid and need a break for awhile.

I'm running pyistrument, I'll post the results for you if you want to have a look at it.

IT definitely seems to be more an issue on my *nix box than on Tromador's Win one.

kfsone commented 2 months ago

Jonathon first, trade calculator for online game later, man.

TD is a vague distant memory for me - I honestly thought one of you added the plugin system after I'd ambled off, and I was kinda wondering wtf the saitek crap was until I looked at one of the files and saw the first line comment. cough

Point: I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me, and while I'm human and will get my knickers in a twist or my butt hurt in the moment when it comes up that I'm responsible for any particular cretinous decision or demonstrate of derp, I'm not a diva: my first job was for a UK "national" contract software company as an embedded engineer, and when the company went belly up I got offers from each customer I'd worked for to go finish development, but I either didn't really "get" the companies or feel adequate, except for one, in my home town, on the fishing docks. So I accepted their offer, which came with a caveat: they weren't big enough to pay for a programmer, they were a cold-store/blast-freezer/fish packing plant. So if it got busy, I worked packing fish, loading/unloading 50lb boxes of frozen shark bellies or halibut. Yeah, I've worked some fancy places since, but I didn't stay.

kfsone commented 2 months ago

In some countries, TD is almost old enough to start putting bread on the table :)

image

Tromador commented 2 months ago

IT definitely seems to be more an issue on my *nix box than on Tromador's Win one.

We aren't on identical hardware though so although I agree from our testing that Linux and Windows do seem to behave differently, don't forget I have lots of CPU, on the other hand you have more RAM than me. It's hard to simply draw comparisons therefore just based on OS. The server is another game entirely, running an archaic version of Centos, such that I have had to build TD (from source) its own special environment to run in with its very own sqlite3 and OpenSSL libraries and a super special Python just for it, along with all the environment shenanigans to make it not use the outdated system versions. Not to mention it's kinda short on RAM, but again is free vhost and server hosting isn't cheap.

I to tend to waffle when I am tired, point being, many variables outside of OS. Particularly hardware.

Tromador commented 2 months ago

In some countries, TD is almost old enough to start putting bread on the table :)

I quite deliberately put Est 2015 at the start of the thread on E:D forums. Other tools may be quicker, flashier whatever, but nothing else can do what we can and nothing else has our stubborn will to survive.

eyeonus commented 2 months ago

I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me,

This is my list of things needing doing at some point. The Niceness value is how important I consider each item, with lower being more important just like *nix CPU prioritizing.


1) N = -20

Anything you can do to improve import performance is welcome. On my box, with the current TD, both spansh and eddblink take a bit over 3 hours to do an import. Obviously different machines will take different times, but ideally anything but a potato should have sub 1 hour import duration. My box has an 8-core CPU with 64GB RAM, so it's not a potato. (Also I checked the drive, and it has 45MB/s+ write speeds, but neither import ever went above 5MB/s, so I highly doubt that's the bottleneck for me.)

2) N = 5

The spicing up of the visuals is certainly welcome, I like what you've done there and would be pleased to see the visual improvements you've done to spansh done to eddblink (and the rest of TD) as well.

3) N=-10

TD needs to be able to build the other Tables. We have two reliable sources, those being the daily dumps from Spansh, and the EDCD lists, such as rare_items.csv

We currently use the Spansh data (and EDDN via the listener) to fill out System, Station, Item, and StationItem, and could potentially use it to fill out Ship, ShipVendor, Upgrade, and UpgradeVendor.

We currently use EDCD to fill out FDevOutfitting and FDevShipyard, and could potentially use it to fill out RareItem, Item, and Category.

There's obviously some overlap, so in the case where we can use either, I think EDCD is the better option. Spansh (and listener): System, Station, StationItem, ShipVendor, UpgradeVendor, Ship, Upgrade EDCD: FDevShipyard, FDevOutfitting, RareItem, Item, Category

That leaves Added as the only one that can't be easily updated, but as that's only used in System, to indicate (I think?) what version of Elite Dangerous a system was added to the galaxy, I feel like it should individually get N=20

Getting everything populated via Spansh means updating to utilize the information provided by the dump that is currently being ignored, as well as pulling from the EDCD sources when building the DB from scratch or they've been updated.

Getting everything populated via the TD listener means listening to the other EDDN messages besides just the commoditiy ones, and processing them accordingly. Since the listener uses Spansh, this only matters for things which change more often than once a day. The eddblink plugin pulls from Trom's server, so getting this into the listener automatically gets this into eddblink as well.

4) N=-18

Rescue Ships are EVIL to TD, because not only do they move, they also change their name, as they are named after whatever station they are currently rescuing. This results in duplicate entries in the prices file whenever two different rescue ships go to the same station. For example, "Rescue Ship - Arc's Faith" has two entries with a different station_id because it's been visited by two different rescue ships. The FDev Id for the ships are different, and searching EDSM, I found one of them is listed as being at HIP 17694 \ Hudson Observatory and the other is at Didio \ Laumer Orbital. The next Spansh dump should have those updates, but even with that, both of the rescue ships are now named differently since neither one is at Arc's Faith anymore. Renaming the station needs to be done so that users don't go looking for the wrong station, and maybe adding the FDev ID to the name to prevent the duplicate station error is a good idea?

5) N=12

Working GUI, as in feature complete with TDHelper. If you have TD installed via pip, you can run tradegui and see the awful attempt I've got so far.

6) N=20

Update the wiki to account for changes made to TD since the last time it was updated.


I can probably get all that done myself, but I know it'll take a long time and be coded poorly, because I'm entirely self-taught on this stuff. I'm currently in college for computer science, so I'm better at this stuff than I was when I picked up the torch, but I'm still learning, and most if I'm learning the hard way.

eyeonus commented 2 months ago

Which reminds me, I set this to run before I left for work. speed.tar.gz

Tromador commented 2 months ago

I'd venture to add. Import performance is the biggest problem as eyeonus says. No doubt. But...

I reinstalled E:D being as we had working TD. I guess I was feeling nostalgic. So I have been using it in anger and there are in fact two things which leave me twiddling my thumbs. The first is absolutely importing data. The second is exporting data, by which I mean getting query results from trade run. The size of the dataset has grown in just the same way for making queries, so that the tree of possibilities which must be examined has become vast. This can be mitigated by filtering down, excluding by age for example, or no odyssey is a good one, but if you want to do a full check with a huge hold, enough money to fill it with whatever, and a jump range sufficient to cross the bubble? It's not usable. I am taking to lying about my real jump range to constrain the tree size.

So yes. Import speed does need optimization, but so does the output.

kfsone commented 2 months ago

@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right?

GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option.

So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service.

Juts thinking out loud here: We pay an entry fee for storing our data: every operation we need to do has to be expressed textually as SQL, parsed by the SQLite code, sqlite has to plan what to do and then execute on it. Lot of cpu cycles just to start reading. The ordinary payoff is a sort of storage-jit: tell the engine what you want in the abstract and it can figure out the fastest way to do it. We're not doing elaborate queries with sophisticated multi-table joins, groups, sorts, wheres, etc.

I mean, SELECT * FROM System is actually better for us than SELECT * FROM System ORDER BY .... We don't have geospatial indexing turned on, and the SQLite options there would hurt us anyway.

Spitballs, consider each individually not as a sequence:

for system in systems: INSERT system INTO SystemIncoming for station in stations: INSERT station into StationIncoming executemany ( INSERT item INTO ItemIncoming FOR item in station.items ) if thread: thread.join thread = Thread( move incoming -> )

Using the temp table means we can avoid either building our own list or the cost of per-item queries or database indexes while we're doing this work.

We can also parallelize the work here; we fully populate the temporary table without index overheads etc, and then have sqlite move things over. Now we're making SQL pay for itself because the query planner/executor can use its own internal data about quantities and stuff to do the transfer.

That's the theory - I don't know if it's true for SQLite.

We intuitively get why a movie ticket costs $25 when renting the film costs $5, but TD is paying $25 to rent. SQLite's benefit is that you don't need a server, it's con is that you dont' get a server doing all the heavy lifting in another thread/process or possibly on a different cpu. That's a fundamental part of the hitch here. We've turned on indexes and joins. We've started doing vacuums and the problem is they can only happen while there's a process running, and we're it.

We could: a) have TD split itself in two, a TradeAdder that implements TradeDB and soaks up the costs of keeping the database fresh, and a TradeGecko that shuttles stuff back and forth between the user and the backend; not entirely implausible, mostly separating concerns and building facades/bridges b) Make it possible or even a default to have a persistent database spun up one of several ways - docker, podman, vm, local or remote url and use that, with sqlite as a naive fallback

This is where we might look to something like SQLAlchemy, PeeWee or pony. It's also where we might actually change some of the operations encumbered on TradeDB so that it doesn't have to in-memory everything.

A variant of (a) in the above, we literally just make TD keep itself alive over probably a non-sql persistant store (either embedded or a second process; key value or nosql: leveldb, rocksdb, memcache) and then have a very simple api for talking to it that it can use on itself; either thru json-rpc with something like flask/eel or literally have it read stdin and accept/process trade commands and pass them to the existing command-line arg parser:

> trade.py repl
trade@SOL$ import -P eddb -O solo

via pseudo code:

for line in stdin:
  args = ["trade.py"] + split_args(line)  # deals with quotes etc
  main(args)  # <-- the old main, which can't tell they didn't actually come from the command line

I don't know off the top of my head where we can leverage this right now to any good use; I created an experimental Rust-based python-package providing a rust version of the "count lines in file" method. That's a hard one to optimize because cpu isn't the main bottleneck. I did a version where it would on startup create a number of counting workers, and then you simply gave them byte-regions to count from.

And there was no version where it wasn't slower with concurrency. I wasn't exhaustive by any means, but you're trying to apply a very simple set of logic and rush thru memory, so there's not a lot of juggle room.

kfsone commented 2 months ago

Which reminds me, I set this to run before I left for work. speed.tar.gz

that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server?

eyeonus commented 2 months ago

Which reminds me, I set this to run before I left for work. speed.tar.gz

that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server?

That's not the server, that's my machine.

Tromador would have to provide you with any server runs.

eyeonus commented 2 months ago

@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right?

??? What requests.txt file?

GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option.

I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.

So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service.

Spitballs, consider each individually not as a sequence:

* Incremental imports

Since the speed of the import goes down as the table gets bigger, this sounds like a great idea. The vast majority of the information goes into StationItem: all the other tables, completely filled with the spansh data, don't amolunt to even .5GB, but add in the market data and it grows to ~7GB, so while I doubt we'd see much improvement doing this on any of the other tables, it might be a big help for that particular one.

* Persistent database
* Persistent td/REPL

I"m sorry, that's all over my head. I defer to your expertise.

kfsone commented 2 months ago

| I"m sorry, that's all over my head. I defer to your expertise.

Oh, just to make you regret saying that:

| ??? What requests.txt file?

... was a typo of "requirements.txt". What I mean't was, in order for anyone to pip install Trade-Dangerous, it depends on requirements.txt on its own, right?

Duh. Gonna check that real quick, dum-me:

image

so it should be safe to remove all that crap screwing around to make sure requests is installed (in transfers.py), haha

eyeonus commented 2 months ago

Oh. Yes.

kfsone commented 2 months ago

| I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.

I don't have useful input on GUIs really; 'gradio' is - I believe - what people are using to build web-based-gui-apps for things in python, so it's not sexy but you spend a lot more time on what the gui elements do than on the screen itself. So, for instance, "StableDiffusion" has a gradio-based UI, as does the popular "run your own local GPT" https://github.com/oobabooga/text-generation-webui... https://www.gradio.app/ Yeah, I know it says machine learning, but they're pitching to their new market because of those last two.

I've used flask for building UIs before but then you need to build an entire front-end application in angular or react or cut-my-wrists-for-me-now-please.

If you're going to be old school about it and do a desktopy app; wxWidgets is shit, but you can keep it relatively simple and use 'wxFormBuilder' to design the app - I've done that a few times lately: https://github.com/wxFormBuilder/wxFormBuilder.

So pySimpleGui might be rockstar - check that out before wxWidgets, because I hate it as much as I use it.

You could conceivably do something with Dear imgui - I found it easy enough to use I wrote wrappers for it in C++ (https://github.com/kfsone/imguiwrap) when I used it a few years ago, the python version is ok, but it's best use case imho is for in-game debug screens/menus etc.

So I'll look at doing some experiments on the item import and how it is currently working.

How often have frontier been adding items?

I wonder if it would make sense to actually alter the row format of StationItem so that there are two rows per station: one for selling, one for buying.

kfsone commented 2 months ago

Right now I'm trying a little experiment that reintroduces a per-station cursor; we're not building a list of items for executemany, but we're also not doing the full cost of adding each and every item in its own mini-transaction. Also breaking up the code a little-bit where we're not invoking methods in loops and we can afford the extra 40-50ns it takes to call a method in python lol.

So for right now I think it would be useful for us to look at:

kfsone commented 2 months ago
Win|PS> python -m apsw .\data\TradeDangerous.db
SQLite version 3.45.3 (APSW 3.45.3.0)
Enter ".help" for instructions
sqlite> select count(*) from item;
┌──────────┐
│ count(*) │
│ 395      │
└──────────┘
sqlite>

forget the row-shape idea then lol

kfsone commented 2 months ago

Just so you know - I'm not remotely above figuring out how to get an AI to help me, but I'm also not expecting it to be actual help. However... Giving it the ability to read the repository definitely makes it more useful: https://github.com/kfsone/ggama

kfsone commented 2 months ago
$ ls -1sh /dev/sda /dev/sdb
/dev/sda 1PB
/dev/sdb 15MB
$ rm -rf pride  # surplus to requirements
$ mv /sdb/shame /sdb  # storage upgrade required

image

kfsone commented 2 months ago

Do we care about the added column any more?

eyeonus commented 2 months ago

Do we care about the added column any more?

I gave the Added TABLE an N=20

I never cared about that column. :D

eyeonus commented 2 months ago

Just want to say, the recent PR is definitely an imrpovement. Even with the additional 'optimization' step, it finishes processing both listings and listings-live faster than the old way could get through just listings.

I'm happy, and extremely grateful.

kfsone commented 2 months ago

Just doing an experiment with the "import to a different table first" strategy. Took 45 minutes to the import first time, then I repeated it and it takes it less than 1 minute to write the data to the unindexed table.

Now it comes down to how long its going to take merging the tables.

We may need to ditch the partial indexes.

CREATE INDEX si_itm_dmdpr ON StationItem(item_id, demand_price) WHERE demand_price > 0;
CREATE INDEX si_itm_suppr ON StationItem(item_id, supply_price) WHERE supply_price > 0;

and instead, have separate tables for demand and supply.

Or maybe we should drop the indexes before the transfer and re-add them.

https://github.com/kfsone/Trade-Dangerous/tree/kfsone/import-via-table -- check it out;

Expect low presence from me over the weekend, wife points low.

kfsone commented 2 months ago

I remember why I added those indexes - a major difference between a persistent database service and an embedded database process being causes a major hitch point when you try to get a list of only the items that are being sold from stations, because the 'supply_price > 0' ends up being expensive in a where clause where a persistent database would deal with it.

But I don't remember what I added them for - as in where the pain point surfaces.

Option a: Refactor StationItem thus:

CREATE TABLE StationItemSupply (station_id, item_id, credits, units, level, modified);
CREATE TABLE StationItemDemand (station_id, item_id, credits, units, level, modified);
CREATE VIEW StationItem (station_id, item_id, demand_price, demand_units, demand_level, supply_price, supply_units, supply_level, modified) AS
... well, I'm not sure otomh ...

Option b: ditto, but change references to StationItem accordingly rather than creating the view,

Option c: add "in_demand" and "in_supply" bools and make the where based on that -- this becomes useful as long as we don't change the field;

Option d: add a separate table with the in_ flags since that might be easier to avoid updating.

Option e: use NULL -- that is, make one or more of the fields NULLABLE as a way to say "not present", the theory being that value and nullness are two different things, and it may be easier for the db to early-out when it detects that nullability did not change. This is an open question.

Option f: Switch to a different storage engine

kfsone commented 2 months ago

Another way to deal this would be bitmaps; assign every station and item unique ids and then create bitmaps for each access. they'd be kind of big but if you can mmap them then queries against them become pretty trivial, and the filesystem ought to be able to eliminate empty pages; or we use a hash-bucketing approach to constrain the quantity of pages.

eyeonus commented 2 months ago

We do use unique ids for each station and item, we're using the FDev ids, and those are unique.

kfsone commented 2 months ago

Ah, I mean a "local", 0-based unique id - we could probably use row_id except we have no control over when they gets recycled.

If we use the fdev ids - they get quite massive, and we don't want to create a bitmap which has 128,000,000 empty bits at the start to reach the first id :)

eyeonus commented 2 months ago

Oh. So the filesystem-eliminates-empty-pages thing, and the hash-bucketing thing, wouldn't eliminate the empty bits in that case? Or am I misunderstanding something?

ultimatespirit commented 2 months ago

Can I suggest we expand this concept:

The server side works. The client side works (apart from one warning, because a Rares station is in lockdown and doesn't appear in market data).

It works.

Just wanted to chime in that between v10.16.14 (2024-04-23) and v11.1.0 (2024-04-27) (it is quite frankly ridiculous that a new version is being created seemingly every single commit by an automated spam bot btw) this statement does not appear to be true anymore.

Since it's about impossible to figure out what a stable release could be I can't really test if this is just my setup or not, but it's likely at some point in the refactoring process (which is great to see by the way, cutting out the weird multiple layers of database formats would definitely make this clearer) the buildcache functionality completely broke. In an incredibly annoying way too.

As I've found out, even doing $ trade.py import -P eddblink (as per the wiki) will not spare you from making the sqlite database, so any command will tart trade.py buildcache -f it looks like. This command then promptly fails from the rescue megaship problems above causing it to crash.

Okay great, pruning the duplicates (only 5 of them or so) by age of last entry (none of their data is even from this year...) got it to continue. Somewhere between an hour or two hours of single threaded python parsing later it parses the 4.1gb prices file eddblink presumably created on initial import into a 6.8gb data/TradeDangerous.new sqlite databse, next to the already existing (I guess also downloaded by eddblink) 6.1gb data/TradeDangerous.db file. Aaaand then it crashes from apparently closing its own database link:

$ ./trade.py run --from "any-system-here" --credits 50m --capacity 192 --ly-per 15 --ls-max 1100 --routes 3 --hops 5 --loop --progress --summary

NOTE: Rebuilding cache file: this may take a few moments.
NOTE: Import complete: 44257630 new items over 243,411 stations in 23,048 systems
Traceback (most recent call last):
  File "...Trade-Dangerous/./trade.py", line 44, in <module>
    cli.main(sys.argv)
  File "...Trade-Dangerous/tradedangerous/cli.py", line 66, in main
    trade(argv)
  File "...Trade-Dangerous/tradedangerous/cli.py", line 96, in trade
    tdb = tradedb.TradeDB(cmdenv, load=cmdenv.wantsTradeDB)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...Trade-Dangerous/tradedangerous/tradedb.py", line 618, in __init__
    self.reloadCache()
  File "...Trade-Dangerous/tradedangerous/tradedb.py", line 700, in reloadCache
    cache.buildCache(self, self.tdenv)
  File "...Trade-Dangerous/tradedangerous/cache.py", line 1004, in buildCache
    tempDB.commit()
sqlite3.ProgrammingError: Cannot operate on a closed database.

That's very likely caused by this db.close() line in the processPricesFile() function in cache.py:

7c60def3 tradedangerous/cache.py (Oliver Smith          2024-04-24 11:16:29 -0700  709)     tdenv.DEBUG0('Committing...')
9e7d4834 cache.py                (Oliver Smith          2014-12-25 23:24:00 -0800  710)     db.commit()
7c60def3 tradedangerous/cache.py (Oliver Smith          2024-04-24 11:16:29 -0700  711)     db.close()
d509fda7 tradedangerous/cache.py (eyeonus               2019-02-14 12:51:07 -0700  712)     

As called by cache.buildCache() here, note that tempDB is db in the above function, which now gets closed (and it attempts to close again had it succeeded, classic use after free)

b9be772e cache.py                (Oliver Smith 2014-11-27 03:37:23 -0800  995)     if pricesPath.exists():
7af37fa7 cache.py                (Oliver Smith 2014-12-02 20:41:11 -0800  996)         processPricesFile(tdenv, tempDB, pricesPath)
44b8f74f cache.py                (Oliver Smith 2015-01-01 12:47:52 -0800  997)     else:
44b8f74f cache.py                (Oliver Smith 2015-01-01 12:47:52 -0800  998)         tdenv.NOTE(
44b8f74f cache.py                (Oliver Smith 2015-01-01 12:47:52 -0800  999)                 "Missing \"{}\" file - no price data.",
44b8f74f cache.py                (Oliver Smith 2015-01-01 12:47:52 -0800 1000)                     pricesPath,
a7dba2f6 tradedangerous/cache.py (Oliver Smith 2024-04-21 21:25:52 -0700 1001)                     stderr=True,
44b8f74f cache.py                (Oliver Smith 2015-01-01 12:47:52 -0800 1002)         )
d509fda7 tradedangerous/cache.py (eyeonus      2019-02-14 12:51:07 -0700 1003)     
b6f3a49f cache.py                (Oliver Smith 2014-12-17 23:26:36 -0800 1004)     tempDB.commit()
8fb2c0b4 buildcache.py           (Oliver Smith 2014-11-15 21:33:09 -0800 1005)     tempDB.close()
7c60def3 tradedangerous/cache.py (Oliver Smith 2024-04-24 11:16:29 -0700 1006)     tdb.close()

I'm a bit confused what the data/TradeDangerous.db file I have is from for there to be 44,257,630 new items tbh, based on the git issues here it sounds like the EDDBLink server spends a lot of time doing an import of the several GB json dump file to make a database of the latest data that users can consume. Perhaps the wiki is just very out of date and the way to do that is different but my usage was instead downloading a several GB set of files from that server to... spend a lot of time making the latest database (that then crashes) and have nearly as many new items as the entire length of TradeDangerous.prices.

Oh and to make this even better, it sees the first crash as a legit database that didn't crash I'm guessing as I have data/TradeDangerous.old of 144kb in size that doesn't get replaced by the TradeDangerous.new file upon retry, instead just clobbering TradeDangerous.new so 2 hours of processing down the drain oops. My fault for not backing everything up every step of the way for fear of the program shooting itself in the foot.

TL;DR / Summary section

There's some diagnostic stuff above, you have a use after free error introduced in commit 7c60def3 presumably as debugging. Then some "hey as a prospective user who isn't also an active developer on this" feedback that this current development spike is great, but also your safe release processes are lacking, making this not very usable at the moment.

I installed via git clone ... but your pypi repo is setup to auto publish every erroneous tag that spam bot publishes. It goes version 10.13.10 published on 2023-02-13 to 10.14.2 on 2024-03-23, and then 32 new versions from 2024-04-15 to 2024-04-27. At v11.0.0 the breaking bug from 7c60def3 gets added in. All versions afterwards are guaranteed broken (this bug clearly hasn't been found yet). git clone ... I can somewhat see a reasonable workflow of saying "our default branch is the dev branch, checkout a tag if you want stability", but pip install... is right now completely broken too.

Because there is no stable release. Releases are autogenerated from the current pinned branch release/v1 by that github bot, and you all have been developing on release instead of a development branch. I'm guessing v10.14.2 is the last "stable" and working release, as that should be when the spansh update info was added in, but I'm not really sure there that's just a guess from reading the github issue threads.

Please, either disable that tag bot or start developing on a dedicated development branch rather than all in one branch. Or both, both is pretty good too. And then you'll need to publish another breaking change major version v12.0.0 that rolls back to whatever is the last known good stable release, or at least somewhat stable, for pip.

Oh, also, it would be quite nice if the for item in items: ... db update loop in processPricesFile() had at least some progress indicator, something like every len(items)//10 items printing how many items are left would be grand. Being able to restart from where it left off rather than clobbering the DB would be great too, but understandably that's likely more complicated to handle consistently. Aaaand in case y'all weren't aware somehow, the type hints are great, but you have type errors being exposed by them. Either functions were too narrowly typed (since the code somehow works), or there are logic errors going uncaught, though understandable if it's just being treated as a complete debug / dev branch warts and all.

And, I apologise if the tone comes off a bit miffed, debugging this mess was quite the chain of "why is it doing that?" frustrations I wasn't expecting to need to do for a supposedly stable tool that looks really cool and I'd love to be able to use and support the use of it, not helped by it deleting its own database. Y'all are doing great stuff. Just, pretty please, stop developing on and pushing to "release", or stop the bot spamming release tags every commit, or both.

eyeonus commented 2 months ago

Thank you for catching that. I had no idea that error was there, obviously I haven't been testing the code well enough. As far as the "developing on release" thing, that's also entirely me, Oliver has been developing on his own private branches and issuing PRs.

As for the bot, that's not going away. That said, since I shall now be endeavoring to work on branch and merge rather than work on main, you'll be seeing a lot less releases being made. It doesn't make a new release for every commit, but I won't get into the details of how it decides to do so, as that's probably a bunch of information you don't care about. If you do, look up 'python-semantic-release', it'll do a better job explaining than I could anyway.

eyeonus commented 2 months ago

I'm currently testing my fix to the cannot operate on closed DB issue you found.

ultimatespirit commented 2 months ago

I just finished confirming the full rebuild finished and trade.py worked again with this change:

$ git diff
diff --git a/trade.py b/trade.py
diff --git a/trade.py b/trade.py
old mode 100644
new mode 100755
diff --git a/tradedangerous/cache.py b/tradedangerous/cache.py
index 9523438..449c81c 100644
--- a/tradedangerous/cache.py
+++ b/tradedangerous/cache.py
@@ -668,7 +668,11 @@ def processPricesFile(tdenv: TradeEnv, db: sqlite3.Connection, pricesPath: Path,
     removedItems = len(zeros)

     if items:
-        for item in items:
+        progressSize = len(items)
+        progressMark = progressSize//10
+        for (nu, item) in enumerate(items):
+            if nu % progressMark == 0:
+                print(f"Processing cache items, {nu}/{progressSize} = {nu//progressMark}% (10% = {progressMark})")
             try:
                 db.execute("""
                     INSERT OR REPLACE INTO StationItem (
@@ -708,7 +712,6 @@ def processPricesFile(tdenv: TradeEnv, db: sqlite3.Connection, pricesPath: Path,

     tdenv.DEBUG0('Committing...')
     db.commit()
-    db.close()

     changes = " and ".join("{} {}".format(v, k) for k, v in {
         "new": newItems,

The only necessary one being the removal of the additional db.close() in processPricesFile(), the print statements are just there for that progress estimator I spoke of above. Obviously if you did want to have more progress marks there it should use tdenv.INFO() or whatever rather than the prints. Just included since it was already in my diff output.

kfsone commented 2 months ago

@ultimatespirit appreciate the diligence and the detailed analysis; apologies to both of you for the churn I've caused lately - @eyeonus I'll try to do better about making less work for you man. I wanna make sure you don't think I'm gonna be a diva if you have to put a flea in my ear or point out I'm a nub.

| Oh. So the filesystem-eliminates-empty-pages thing, and the hash-bucketing thing, wouldn't eliminate the empty bits in that case? Or am I misunderstanding something?

The eliminates-empty-pages thing would be "spare files", but when I say "empty pages" I'm thinking in terms of a handful, as opposed to the file starting with 3906 empty pages (pages required to represent the 128000 unused IDs at the start of the ID-space).

But I'm spitballing there, the hash-bucket has pros-and-cons, primarily that it requires a computation to get from actual-fdev-id to bit-no in the bitmap, and after doing it the resulting bitmap is only good for a one-way lookup: the "stations selling this item" bitmap on an item object can be used to go from a station id to do a quick bit-check of "does this station have this item", but it might not be possible to do the reverse, to say "bit 40401 is set, what does that mean?" because there isn't guaranteed to be a meaningful way to reverse the algorithm and calculate what station id bit 40401 represents.

I did some experimentation to try and find algorithms that might work but at the end of the day I think it will be easiest to just have some kind of per-instance hidden column or table that collapses fdev_ids into near-sequential smaller numbers. I'll table those for an actual presentation of implementation and thinking when I have opportunity to put it together.

kfsone commented 2 months ago

The reason there's no progress display during the listings.csv download is that @Tromador's server isn't giving a size on the download, and the transfers.py code decides not to try and represent the progress.

The rich progress.py changes I have in the "import-via-table" branch at least has a go at this, and we can improve on it:

image

and the Processing stage is night-and-day. The nice thing about using rich is that the progress stuff happens in a thread so you aren't bleeding away performance in the tight loop.

https://github.com/eyeonus/Trade-Dangerous/assets/323009/b23fe8c3-26fd-47a7-a70c-73bb24bf840a

It also uses these to let you know it's still working when it gets to rebuilding the index at the end, too.

kfsone commented 2 months ago

(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng)

kfsone commented 2 months ago

image

ultimatespirit commented 2 months ago

The eliminates-empty-pages thing would be "spare files", but when I say "empty pages" I'm thinking in terms of a handful, as opposed to the file starting with 3906 empty pages (pages required to represent the 128000 unused IDs at the start of the ID-space).

Not sure what the pattern is for fdev IDs, but you can remove the starting empty spots by just reducing all IDs by whatever the lowest fdev ID is. I.e. if the IDs start at 128,001 just offset all IDs by 128,000 as your zero index. From there though it's still not necessarily efficient to use a bitmap, are IDs always increasing and densely packed? If they aren't, and there are holes in the range, you'd need to non-trivially map them by order, 128,001 -> 0, 128,010 -> 1, etc. and that has no easy conversion back that doesn't involve an array access at least (have an array of IDs matching indices to bitmap positions). Which then runs into the next question, if it's sparsely populated does fdev fill in holes later? Because if they ever fill in a hole well... you have to completely rebuild that mapping, not fun to update an existing database to handle, though perhaps a worthwhile one-time cost if it has enough performance benefits and is considered a rare event.

The reason there's no progress display during the listings.csv download ...

Also, in case that was in response to my talking about progress above, I was actually referring to the DB rebuild steps / prices file processing steps, less so the downloads. But your prototype for more progress messages looks great. I noticed that new UI when spinning up the listener earlier today (though had to turn it off shortly after as it seems to download the spansh dump every hour even if it was listening to the EDDN the entire time in between, which is what spansh uses too?? I'm guessing there's some edge case requiring that...).

kfsone commented 2 months ago

| I was actually referring to the DB rebuild steps / prices file processing steps,

The 10s video I posted is actually it processing the csv file, not downloading it.

I didn't record the tail part because that's the part I'm working on, and at the moment it just sits there for 25 minutes showing this:

image and then a few more showing this: image

(but the spinner is animated the entire time as is the timer)

@eyeonus the cause of this is these two indexes in particular: image

I can look at splitting the data into two tables so it doesn't need that index, if that's useful. It'll mean my doing some diligence on the various commands that access price information from the StationItem table to get it right.

kfsone commented 2 months ago

| Not sure what the pattern is for fdev IDs, but you can remove the starting empty spots by just reducing all IDs by whatever the lowest fdev ID is. I.e. if the IDs start at 128,001 just offset all IDs by 128,000 as your zero index. From there though it's still not necessarily efficient to use a bitmap, are IDs always increasing and densely packed? If they aren't, and there are holes in the range, you'd need to non-trivially map them by order, 128,001 -> 0, 128,010 -> 1, etc. and that has no easy conversion back that doesn't involve an array access at least (have an array of IDs matching indices to bitmap positions). Which then runs into the next question, if it's sparsely populated does fdev fill in holes later? Because if they ever fill in a hole well... you have to completely rebuild that mapping, not fun to update an existing database to handle, though perhaps a worthwhile one-time cost if it has enough performance benefits and is considered a rare event.

Yeah, that's why I think something akin to row-id is "good enough". It's sole purpose is as a guide for the local bitmap - which is not something that should ever get shared, it's entirely local stateful stuff.

My import just got to the regenerating .prices file part, which doesn't have a progress bar, image but the way the rich stuff works let me get rid of a bunch of code while also adding a context-manager styling for increasing the use of progress bars, so you do something like:

with progress.Progress(None, 40, style=pbar.CountingBar) as pbar: ... regenerate prices()

and that's enough to add a concurrent progress spinner/time-elapsed bar.

eyeonus commented 2 months ago

(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng)

That's certainly possible, the header we pass is headers = {'User-Agent': 'Trade-Dangerous'}

eyeonus commented 2 months ago

I'm fine with using a row_id if you think it'll help things. StationItem currently uses (station_id, item_id) as the PK, so adding a row_id to the table shouldn't bork anything, except maybe the two unique ids thing with the export?

kfsone commented 2 months ago

They would be something totally transient/private to the data set, almost even to the particular generation. I've been running various experiments/investigations but I probably need to actually create a couple of branches in my fork and try some of them out in more detail.

for instance - I took the hash-bucket thing for a spin, trying to find if there are numbers that would actually work for us, and I actually found some cheap algorithms for transforming the range of fdev_ids in my local Station table into a 0-N range of unique numbers that essentially required 3-bits of bitmap per station to be able to create a bitmap for each item saying "these are all my stations" - so the indexes we currently have could be condensed down from 64-bits-per-station-vs-item to ~3-bits-per-station on each item's row.

I ran an import, which added a station ... ran my test again, and the new station collided with another id after running the algo again. son of a...

there's a python version here https://gist.github.com/kfsone/f7f7de6d7c0216cf44cdf97b526a474e and a rust version (cause the python version was gonna take several hours to run, I should have tried it under pypy instead) https://github.com/kfsone/tinker/tree/rust

---------- it's late, I'm blathering ... sorry.

Current big hurt has to be the two indexes that have to present on the StationItem table to make it practical for buy/sell to distinguish only rows important to them. At a smaller scale, it was actually better having one table with sell/buy prices because you avoided a join, but now it forces us to maintain a massive conditional index. power hurt assemble

So I can focus on splitting that into two tables on that import-via-table branch over the next several days, if you like. It currently has the new progress bar stuff in it because I wanted/needed the extra visibility while running the long tests.

Rich is pretty frickin' cool; https://rich.readthedocs.io/en/stable/ it brings a ton of power to the table, down to the ability to generate a recording of an apps output, and almost all the capabilities are packaged behind self-demoing modules: python -m rich.progress python -m rich.live python -m rich.table python -m rich.logging python -m rich.tree python -m rich.spinner python -m rich.status python -m rich.layout use it as a terminal markdown viewer: python -m rich.markdown README.md

I had written various bits and pieces of this stuff into TD early on, but I'm awful at UI so none of it is great, and switching it over to using rich should improve quality/appearance/speed and reduce the amount of maintainable code at the same time.