SEVERE performance issues doing EDDBlink import when DB is large.

eyeonus commented 2 months ago

Not sure what the problem is, but there's a huge difference in processing time when doing an import using EDDBlink plugin when the database file is large vs. small. To whit:

TradeDangerous.db file size: 6.5 GiB listings-live.csv file size: 10.3 MiB Time to completion: ~24 minutes

NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 11:14:51.958359
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 11:38:17.200988

versus: TradeDangerous.db file size: 43.7 MiB (empty StationItem table, otherwise identical to above database) listings-live.csv file size: 10.3 MiB (Same file as above) Time to completion: ~7 seconds

NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 12:20:00.816731
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 12:20:07.871285

Everything is exactly the same in both cases except for the size of the StationItem table in the database being imported to.

eyeonus commented 2 months ago

That all sounds great. Feel free to work on the two table thing, and if it turns out to be much better than the current implementation, I consider that a win.

Tromador commented 2 months ago

(didn't mean to throw you under the bus @Tromador - I should have said: our call to trom's server for the listings.csv isn't getting length information -- likely that there's a header missing in the request or sometihng)

So long as it's an old school Routemaster, I will happily be under the bus.

The reason the server doesn't send content length header is because it streams listings.csv as gzipped content and because it's doing the compression on the fly, it doesn't actually know how the size of the data it's ultimately going to send.

I guess listener could gzip listings.csv and then apache would know how much data it's sending?

I have a terrible feeling of deja vu here. See, I have sarcoidosis and one of the many symptoms is brain fog and I have become much more forgetful than I once was, but I have an itch in the back of my brain which suggests we've been here before. Why did we choose to have the webserver do on the fly compression instead of just creating compressed files?

Tromador commented 2 months ago

I guess if listener compressed the files, apache sent listings.csv.gz then eddblink will have to uncompress it.

As it is, the download is handled by urllib, which reads the headers, and uncompresses on the fly as the file downloads. Ultimately it's about saving a lot of bandwidth (and thus download time) given the innate compressibility of text.

kfsone commented 2 months ago

Interesting: We do an initial probe for the timestamp of the file using urllib.request using POST: so I could capture the "actual length" to give me a guide.

But I should probably also eliminate that separate query, because it means we have two - possibly international - SSL-based queries going on (it takes about 2s to determine if there's an update from here in CA) and worse, you actually start sending the file - so it's doubling the bandwidth use (and why the actual downloads start slow for the end user)

kfsone commented 2 months ago

I can save a bunch of bandwidth here by switching from urllib.request.request_open to requests.head:

and this is actually good stuff for the downloader, because we actually want the uncompressed size, since the downloader never sees the compressed data in the first place.

Win win win.

kfsone commented 2 months ago

I name this branch: make imports lightning fast

eyeonus commented 2 months ago

Interesting: We do an initial probe for the timestamp of the file using urllib.request using POST...

Yeah, that's totally my fault. I didn't know any other way to do it.

Tromador commented 2 months ago

But I should probably also eliminate that separate query, because it means we have two - possibly international

Server hosting kindly donated by Alpha Omega Computers Ltd who are in the UK. Until I got sick, I was a company director there. The deal is that I get free hosting and they get to call me to ask dumb questions once in a while.

Tromador commented 2 months ago

I can save a bunch of bandwidth here by switching from urllib.request.request_open to requests.head:

Pretty sure that 'text' is not a valid encoding standard. IANA HTTP content coding registry.

If it is allowed (or works anyway) make sure you don't use it for the download as you will be telling the server you only grok plain text and can't accept compressed content and it will presumably comply, sending the whole thing in the requested, uncompressed format.

eyeonus commented 2 months ago

I can save a bunch of bandwidth here by switching from urllib.request.request_open to requests.head:

Pretty sure that 'text' is not a valid encoding standard.

I tested it, and it didn't give any errors. Progress bar showed up and everything.

kfsone commented 2 months ago

@Tromador Just wasn't sure how to tell it to not encode. And this is only for an HTTP HEAD request:

requests.head(url, headers={...})

Should probably give transfers.py the ability to open and maintain a "Connection" so that it can turn around the queries more rapidly - at the minute it has to do the https handshake on all of them which takes time and costs cpu/bandwidth it doesn't strictly need to, but I don't think it's hurting anyone atm so nice +19 :)

Tromador commented 2 months ago

@Tromador Just wasn't sure how to tell it to not encode.

I think you want 'identity' then, which says "I want the original with no encoding whatsoever".

@eyeonus It may work, but that doesn't make it right, I have been through the RFC and it refers right back to the list in the link I gave above. Doing things correctly per standards is important for the day such a loophole is plugged in a patch and then suddenly it doesn't work. I had a good trawl on the net looking for "text" in the context of an "accept-encoding" header and can't find it. If you can, I am happy to be educated.

eyeonus commented 2 months ago

No, in this case I'm the one to be educated.

kfsone commented 2 months ago

No! It were me as wuz educated. Ready for a round of "wen ah were a lad?"

Tromador commented 2 months ago

No! It were me as wuz educated. Ready for a round of "wen ah were a lad?"

Well, I am a Yorkshireman :)

kfsone commented 2 months ago

I'm from a place called Grimsby which is neither Up North, Down South, Off East or Out West. It has its own ordinality: Theranaz. People from Grimsby are Up-In Theranaz.

GRIMSBY (n.)

A lump of something gristly and foultasting concealed in a mouthful of stew or pie. Grimsbies are sometimes merely the result of careless cookery, but more often they are placed there deliberately by Freemasons. Grimsbies can be purchased in bulk from any respectable Masonic butcher on giving him the secret Masonic handbag. One is then placed correct masonic method of dealing with it. If the guest is not a Mason, the host may find it entertaining to watch how he handles the obnoxious object. It may be (a) manfully swallowed, invariably bringing tears to the eyes. (b) chewed with resolution for up to twenty minutes before eventually resorting to method (a) (c) choked on fatally.

The Masonic handshake is easily recognised by another Mason incidentally, for by it a used grimsby is passed from hand to hand. The secret Masonic method for dealing with a grimsby is as follows: remove it carefully with the silver tongs provided, using the left hand. Cross the room to your host, hopping on one leg, and ram the grimsby firmly up his nose, shouting, 'Take that, you smug Masonic bastard.'

-- Douglas Adams, Meaning of Liff

eyeonus commented 3 weeks ago

@kfsone I'm in the process of making the spansh plugin capable of filling the Ship, ShipVendor, Upgrade, and UpgradeVendor tables from the spansh data.

Because of the data that is (not) provided by spansh for the ships (cost) and upgrades (weight, cost), I'm wanting to change those tables to include the information that is avaialble.

Current testing indicates that the new code works as expected on a clean run (no pre-existing DB), but with an already existing DB the tables won't match, which borks everything.

Do you know if TD already has code to do this?

If not, where would be the best place to have TD check if the DB needs updated, and if so drop the old table and add the new version? I could put it in the plugin itself, but it might be better to put it somewhere in tradedb or tradeenv?

eyeonus commented 3 weeks ago

Ideally, I'd like to see if ./tradedangerous/templates/TradeDangerous.sql matches ./data/TradeDangerous.sql and if not, apply the difference from the template to the DB. (So for example if the Ship table in the template doesn't match, TD will drop the DB Ship table and create the template's Ship table in the DB)

kfsone commented 3 weeks ago

Yes there's a schema-migration system, but it's not particularly smart; basically it operates on the sqlite "pragma user_version". Take a look in cache.py.

In a bit of karmic bloody-nosing while I was working on it (4 weeks ago?), we suddenly needed to change our sqlite schema at work for the first time in 15 years, and I ended up writing a more advanced version there, but I'll see if the boss is OK with me donating it.

From the runbook here's a "change 3" which is recovering from a previous mistake but compounds it, and change 4 that fixes things finally.

MIGRATIONS = {
    3: [
        {
            "if": "SELECT COUNT(*) FROM pragma_table_info('History') WHERE name='Preview'",
            "eq": 0,
            "then": "ALTER TABLE History ADD COLUMN Preview BOOL;"
        },
    ],
    4: [
        "DELETE FROM History WHERE HighWaterMark IS NULL OR HighWaterMark < 1;",
        "ALTER TABLE History ADD COLUMN NewWaterMark FLOAT NOT NULL DEFAULT 0;",
        "ALTER TABLE History ADD COLUMN NewPreview BOOL;",
        "UPDATE History SET NewWaterMark = HighWaterMark, NewPreview = Preview;",
        "ALTER TABLE History DROP COLUMN HighWaterMark;",
        "ALTER TABLE History DROP COLUMN Preview;",
        "ALTER TABLE History RENAME COLUMN NewWaterMark to HighWaterMark;",
        "ALTER TABLE History RENAME COLUMN NewPreview to Preview;",
    ]
}

eyeonus / Trade-Dangerous

SEVERE performance issues doing EDDBlink import when DB is large. #126