arp242 / goatcounter

Easy web analytics. No tracking of personal data.
https://www.goatcounter.com
Other
4.54k stars 190 forks source link

Schema design question #760

Closed lpil closed 3 months ago

lpil commented 3 months ago

Hello there! Thanks for this lovely bit of software!

I was reading the database schema and I was wondering why it is that hits references systems, browsers, and paths, rather than putting them in one table.

My guess was that it is sacrificing some write performance to reduce database size. Is that a good guess? If so, did you record any impacts on performance and size by doing this?

Thanks, Louis

arp242 commented 3 months ago

My guess was that it is sacrificing some write performance to reduce database size.

Pretty much, yes. An int is 4 bytes. Many strings are longer than 4 bytes, and there are potentially millions or even billions of rows, so it adds up.

Saving about 50 bytes per row adds up to about 47G for every million rows on the hits table alone. This is then also carried over to some of the other tables and some indexes, so in total it's much more.

A hits row is 68 bytes in a quick check (could be off a few bytes), so 50 bytes per row is a fairly significant percentage.

For small to medium instances this doesn't really matter, because the difference between 100M or 200M doesn't really matter, and even 5G or 10G isn't that important in most cases. But for goatcounter.com it does matter because it's 500G vs. 1T. This also affects query performance, because it has to read less from disk, can cache more in memory, etc.

Write performance has never really been much of an issue. I don't really see why writes should be slower? Less data to write should be faster, no?

lpil commented 3 months ago

I see! Thank you!

Write performance has never really been much of an issue. I don't really see why writes should be slower? Less data to write should be faster, no?

You have to do a get-or-insert on the supplementary tables before you can record a hit, right? So you'd be sending the same amount of data but performing a greater number of queries.

arp242 commented 3 months ago

You have to do a get-or-insert on the supplementary tables before you can record a hit, right?

Yeah, but it caches caches a lot in memory:

browsers           189 items     19K
changed_titles   15255 items   3076K
loc               1795 items    344K
paths            35750 items   5678K
sites             3452 items   2559K
sites_host        3190 items    292K
systems             93 items      9K
ua                2496 items    488K

So about 12M. For a ~450G database, that's not too bad.

lpil commented 3 months ago

Ah! Do you have an in-memory dictionary of recently used paths, etc? Or is that the database's cache you're talking about?

arp242 commented 3 months ago

Just my own cache. Just a simple map with time-based eviction. Simple, but works.

e.g.: https://github.com/arp242/goatcounter/blob/master/user_agent.go#L64-L70

Pattern is similar for most of it.

lpil commented 3 months ago

Lovely. Thank you for the help!