Google Analytics requires upgrades to GA4

marco-brandizi commented 1 year ago

Google has announced long ago that their current services for tracking web visits, named Universal Analytics, will shut down in July 2023 and everything should be migrated to the new service, named Google Analytics 4 (GA4).

After some investigation, I've found the following.

Migrating the UI calls is effortless, we are already using the new code, which is backwards compatible, and we just need to set a new ID to switch to GA4

Migrating the API calls is a different beast, since this has to happen on the server and there is no browser that can use complicated Google-provided Javascript to setup calls to their API correctly. In fact, for this case, Google expects us to use what they call the measurement protocol (MP).

I've managed to make this work (not yet with code, by just playing with HTTP calls, but doing the same with Java is trivial), however, at the moment, the MP does not manage any geographical information tracking. That is, it doesn't consider the client IP, it doesn't resolve it to its geographical location and doesn't show anything about the user provenance (for MP records). From what I've seen, I doubt they have the intention to support this in the near future.

This doesn't prevent us from sending the client's IP via the MP, attaching it as a web call parameter (namely, as a parameter of their event object). Then, one can see a summary of the most frequent IPs that called our API (ie a table of IP/occurrences), even with per-instance/per-dataset split. But, with the MP only, we would see just that.

Another kind of information that should be easy to track via the MP is the domain of those who call our API, eg, ebi.ac.uk would be easy to identify (but /ensembl wouldn't be).

Alternatives:

We could keep using GA, but give up on tracking the location of API calls. As said above, the UI-relatd records would still have this info, while the API records would have the IPs only. It would still be possible to download such IPs (eg, as CSV/Excel) and pass batches of them to services like the ones mentioned in the next point, to obtain their geographical info offline, after they have been tracked. Note that obviously, this is relevant only for those clients that call our API programmatically, calls that come from the UI are fully tracked at least when the end-user opens KnetMiner (by the UI tracking).
We stop offering anonymous APIs and start forcing everyone who wants to use our APIs programmatically to first have an account (even if it's free and not premium). This would allow us to give them an API token, which they would need to use, so that we can identify who they are very precisely. But again, this wouldn't help with map visualisation (in the Google Analytics dashboard). We wouldn't need to inflict this on UI users, only on API consumers.
We could do such IP-to-location resolution on our server (ie, on our WS application) and then send the result to the MP. In other words: an API call arrives with a given client IP to our WS (ie, Docker container), WS finds the country (or town, or other geo-detail), it calls GA/MP to record both the IP and the location info. The geo-resolution could be based on a service like Geolite2 (or their paid and more precise service). This way, we would see info like the client countries, but Google would treat them as plain strings, so we could see a table of country/frequency, but still no map visualisation.
We could switch to something like OpenWebAnalytics, which is free and would run on our own servers. But it would give us more maintenance overhead (also, I still have to check its features and if it has an API for server-side programmatic access).

The first option is fairly quick to implement. The others require quite more time. To be decided and sorted out by, I'd say by the end of May 2023.

KeywanHP commented 1 year ago

We can go with option 1. It has all the benefits we need and is easy. Especially, now that our genepage is client side, I assume GA4 will capture its calls and geolocation too. This is the page linked from Ensembl, wheat-expression, wheatIS, GrainGenes, T3 etc.

Authentication for our APIs is something we should consider for the new KnetMiner architecture.

marco-brandizi commented 1 year ago

Thanks, @KeywanHP. Acutally, at the moment the client-side tracking is rather poor, there is only a tracking call when the UI opens. But we can expand it, add more fine-grained tracking and the like.

marco-brandizi commented 1 year ago

This should be complete now. @KeywanHP, check on the [GA dashboard](), ci-test is sent to the site/property: "Knetminer Test Site - GA4", you can see live hits on Reports/Real Time.

Both the API and UI tracking are going to the events section of this view. Every 24h, the main dashboard is updated too, and I can see the latter has a richer set of reports.

Events are always prefixed with type (UI/API), data source (wheat, aratiny, etc) and the type of event. Each event has parameters (eg, keywords, gene list size)

datasets/poaceae-test and datasets/poaceae are going to the same property, as it was the case for the old UA. Maybe you want to have per-instance properties.

TODO: I need to move the above notes to the wiki/documentation.

Rothamsted / knetminer

Google Analytics requires upgrades to GA4 #750