Open rouilj opened 2 years ago
First off, I would like to make "average time on page" a supported feature; I did some work on that last year but it stalled a bit because, well, reasons.
Adding custom events is still useful for of course, for all sorts of other things as well; there's an issue for it at #191. I'd first like to make the storage better because every time I add something like this it adds gigabytes of data.
This is kind of where we're running in to the limits of what you can do with a "regular" SQL database. SQL works fairly well and the upshot of SQL is that you have just one database, which is much easier to deploy and manage. This is why I stuck with it thus far, but we need to use a few tricks now to make it performant, at the expense of increased storage costs and some limitations on what you can query. For example you can't select "show me all pageviews from Firefox" right now because adding this would add gigabytes of extra indexed data.
With custom events, the problem is similar. We can't query the hits
table directory because that's far too slow if your site grows, so we need a new table and you will have limitations on what you can do with it.
For example you can't select "show me all pageviews from Firefox" right now because adding this would add gigabytes of extra indexed data.
Isn't that something like:
select paths.path from
browser_stats inner join paths on browser_stats.path_id = paths.path_id
where browser_id in (select browser_id from browsers where name = 'Firefox');
maybe with a where clause on browser_stats.day (browser_stats.day > Date('2022-06-05')`).
It would be nice to alter the schema so that each custom field was its own column. This would allow use of the standard aggregate functions without the overhead of parsing the json or the extra size from a string representation.
But adding a data column to the hits table with a json document as the value seems like a good first pass at storing the extra data.
Isn't that something like:
select paths.path from browser_stats inner join paths on browser_stats.path_id = paths.path_id where browser_id in (select browser_id from browsers where name = 'Firefox');
Well, that query takes about 6 seconds for my site, which is far from the largest site. It could be made a bit faster, but that's the issue.
Might that improve if you add an index on browser_stats.browser_id or both browser_stats.path_id and paths.pathid?
Looking at the current schema the only way I can see the inner join working is via a full table scan. I think adding an index on the join properties should result in the planner scanning the corresponding index to determine which rows are suitable candidates.
But I am a db novice so.....
Well, the Goatcounter is absolutely great software and... the (average) time on page is the only missing feature!
I mean, it is a huge difference between "1000000 unique visitors viewed this page" and "950000 visitors spend less than 20 seconds on this page; only 50000 spend more than 20 seconds"
Reading the discussion above - as far as I understand the Javascript/frontend part is not a problem - the time on page can be measured. The problem is how to effectively store it in the database.
Looking at the database schema in the schema.gotxt I see a number of tables having similar structure
create table some_stats (
site_id,
path_id,
some_key_field1, // sometimes this is "day"
some_key_field2,
some_counter
)
It seems, we can add such a table for the (average) time on page. I mean: let us replace "some_counter" field with "time_on_page_seconds".
@arp242 @rouilj - what do you think?
we need a new table and you will have limitations on what you can do with it.
@arp242 - I am absolutely fine having some limitations!
From my point of view - the most important thing is to store the values (time on page) in the database. After this: the Goatcounter dashboard may (by default) show some basic statistics about "time on page".
In case somebody wants advanced statistics: there might be two alternatives:
Some of the pages I am tracking are documentation/tutorial pages that double as reference pages. I am considering tracking the time spent on the page. I hope this will give me some insight on how people are using the page: are they reading it, or looking for a particular piece of info and leaving. I hope to redesign the docs for better usability and am trying to figure out how they are used currently.
My thought was to log this as an event using the following method:
Number(new Date()) - window.page_load_start
would this work?
If so how are events displayed?
I assume I would need to write some SQL and do a little programming to:
Tunneling the time on the page in the title attribute is a hack, but this way I can use sql to compute average time on page as it will be treated as an integer/number. Putting it into the page path would make this more difficult. Ideally I would like to be able to do:
and have a new column in the paths table called "data". Fields not recognized by goatcounter ( e.g. not title, path ...) would be placed in a the data column as a well formed json document. I assume each event that comes in creates a new row in the paths table, if that is not the case maybe the data column needs to be in the hits table.
This would allow me to use the json function in either postgresql or sqlite to extract the data. e.g. sqlite-ish examples would be:
or
or some similar constructs.