arp242 / goatcounter

Easy web analytics. No tracking of personal data.
https://www.goatcounter.com
Other
4.48k stars 184 forks source link

Proposal for tracking time on page - attaching data to events #597

Open rouilj opened 2 years ago

rouilj commented 2 years ago

Some of the pages I am tracking are documentation/tutorial pages that double as reference pages. I am considering tracking the time spent on the page. I hope this will give me some insight on how people are using the page: are they reading it, or looking for a particular piece of info and leaving. I hope to redesign the docs for better usability and am trying to figure out how they are used currently.

My thought was to log this as an event using the following method:

  1. insert javascript in the head that saves a date element as soon a the page loads. Something like:
      <script>
      window.page_load_start = Number(new Date());
     </script>
  2. Attach an event on visibility change using the method in https://www.goatcounter.com/help/beacon adding a title that is the value of Number(new Date()) - window.page_load_start

would this work?

If so how are events displayed?

I assume I would need to write some SQL and do a little programming to:

  1. identify the sessions for the users
  2. extract page views and events for the session
  3. merge the unload event's time on page with the corresponding non-event page hit

Tunneling the time on the page in the title attribute is a hack, but this way I can use sql to compute average time on page as it will be treated as an integer/number. Putting it into the page path would make this more difficult. Ideally I would like to be able to do:

goatcounter.url({
            event: true,
            path: function(p) { return 'unload-' + location.host + p },
            time_on_page: Number(new Date()) - window.page_load_start,
            page_type: 'reference',  // or tutorial or documentation
        })

and have a new column in the paths table called "data". Fields not recognized by goatcounter ( e.g. not title, path ...) would be placed in a the data column as a well formed json document. I assume each event that comes in creates a new row in the paths table, if that is not the case maybe the data column needs to be in the hits table.

This would allow me to use the json function in either postgresql or sqlite to extract the data. e.g. sqlite-ish examples would be:

  select avg(json_extract(data, 'time_on_page'))  from paths where path LIKE 'customizing.html';

or

  select avg(json_extract(data, 'time_on_page'))  from paths where json_extract(data, 'page_type') = 'tutorial';

or some similar constructs.

arp242 commented 2 years ago

First off, I would like to make "average time on page" a supported feature; I did some work on that last year but it stalled a bit because, well, reasons.

Adding custom events is still useful for of course, for all sorts of other things as well; there's an issue for it at #191. I'd first like to make the storage better because every time I add something like this it adds gigabytes of data.

This is kind of where we're running in to the limits of what you can do with a "regular" SQL database. SQL works fairly well and the upshot of SQL is that you have just one database, which is much easier to deploy and manage. This is why I stuck with it thus far, but we need to use a few tricks now to make it performant, at the expense of increased storage costs and some limitations on what you can query. For example you can't select "show me all pageviews from Firefox" right now because adding this would add gigabytes of extra indexed data.

With custom events, the problem is similar. We can't query the hits table directory because that's far too slow if your site grows, so we need a new table and you will have limitations on what you can do with it.

rouilj commented 2 years ago

For example you can't select "show me all pageviews from Firefox" right now because adding this would add gigabytes of extra indexed data.

Isn't that something like:

select paths.path from
   browser_stats inner join paths on browser_stats.path_id = paths.path_id
  where browser_id in (select browser_id from browsers where name = 'Firefox');

maybe with a where clause on browser_stats.day (browser_stats.day > Date('2022-06-05')`).

It would be nice to alter the schema so that each custom field was its own column. This would allow use of the standard aggregate functions without the overhead of parsing the json or the extra size from a string representation.

But adding a data column to the hits table with a json document as the value seems like a good first pass at storing the extra data.

arp242 commented 2 years ago

Isn't that something like:

select paths.path from
   browser_stats inner join paths on browser_stats.path_id = paths.path_id
  where browser_id in (select browser_id from browsers where name = 'Firefox');

Well, that query takes about 6 seconds for my site, which is far from the largest site. It could be made a bit faster, but that's the issue.

rouilj commented 2 years ago

Might that improve if you add an index on browser_stats.browser_id or both browser_stats.path_id and paths.pathid?

Looking at the current schema the only way I can see the inner join working is via a full table scan. I think adding an index on the join properties should result in the planner scanning the corresponding index to determine which rows are suitable candidates.

But I am a db novice so.....

stargazer33 commented 1 year ago

Well, the Goatcounter is absolutely great software and... the (average) time on page is the only missing feature!

I mean, it is a huge difference between "1000000 unique visitors viewed this page" and "950000 visitors spend less than 20 seconds on this page; only 50000 spend more than 20 seconds"

Reading the discussion above - as far as I understand the Javascript/frontend part is not a problem - the time on page can be measured. The problem is how to effectively store it in the database.

Looking at the database schema in the schema.gotxt I see a number of tables having similar structure

   create table some_stats (
      site_id, 
      path_id, 
      some_key_field1,  // sometimes this is "day"
      some_key_field2,  

      some_counter
   )

It seems, we can add such a table for the (average) time on page. I mean: let us replace "some_counter" field with "time_on_page_seconds".

@arp242 @rouilj - what do you think?

stargazer33 commented 1 year ago

we need a new table and you will have limitations on what you can do with it.

@arp242 - I am absolutely fine having some limitations!

From my point of view - the most important thing is to store the values (time on page) in the database. After this: the Goatcounter dashboard may (by default) show some basic statistics about "time on page".

In case somebody wants advanced statistics: there might be two alternatives:

  1. Export to CSV file, import into your own database, and run some SQL to get this statistics.
  2. Self-host Goatcounter and enable some feature flag "advanced_time_on_page_statistics" (when you self-host your database is much smaller because there is only one site in it)