ckan / ideas

[DEPRECATED] Use the main CKAN repo Discussions instead:
https://github.com/ckan/ckan/discussions
40 stars 2 forks source link

Activity Stream connected to ELK/Splunk for analysis #209

Open davidread opened 6 years ago

davidread commented 6 years ago

Admins should be able to analyse key activity in CKAN. This means creation / editing / deleting a dataset or organization, or user permissions being granted.

Example use cases:

Whilst the CKAN logs are mainly unstructured data, the Activity Stream is good, but incomplete and not easily accessible in the database.

Whilst we can continue to show the basic Activity Stream in CKAN's web interface, let's take advantage of external data analysis software to allow more advanced exploration, searching, filtering and graphing, rather than trying to build it into CKAN.

I propose:

(An alternative to a JSON log file would be getting the analysis software to talk to Postgres directly using JDBC, and setting up queries to do lots of joins to get the full Activity Stream. This relies a lot more on the analysis software having this capability and setting up the queries in it. I think it would be better to do the join query work in CKAN, with the result in JSON log, which is really much more flexible, and easily shipped.)

Comments v. welcome!

I'm working with OpenGov to explore this, so in particular please chip in @jqnatividad @jhinds

TkTech commented 6 years ago

@davidread can the analysis software you're thinking of using consume web services? The activity stream is indexed on timestamp and can be paged by timestamp, making retrieval through the API fairly fast. It would be safer for future compatibility then relying on the database structure.

davidread commented 6 years ago

@TkTech good thinking. I found an extension for something like that for the ELK stack, and maybe the other data analysis software can do it too.

However, as I understand it, these apps only facet by the top level keys, so I think there is still a job to flatten data (e.g. promote username from the user dictionary to the top level, and add a key which is a list of all the resource formats). I'm keen for this to be available to any log analysis software, so rather than do the transform in the log analysis software, it might be better as a bit of python code in between the Activity Stream API call and the log file, which is the universal data format for log analysis software. So that might as well be done as a bit of CKAN or CKAN extension.

dkelsey commented 6 years ago

I've got some skills and experience with ELK.
For fun I transformed our catalogue and loaded it into ELK - used create_date as the timestamp. I also transformed orgs ...i'm forgetting what specifically i did. I created a couple of time iines (Tielion) and some viz's. @davidread just point me to and event stream and I'll get it into ELK and share what I did so people can build on it. A handful of visualizations you'd want to see wouuld help.

davidread commented 5 years ago

I've started a repo here: https://github.com/davidread/ckanext-analytics

It's got a simple script that exports Activity Stream as JSON lines, to play about with.

@dkelsey I'd be very happy to get your feedback - I'm not clear if JSON is the way to go with this or whether Kibana and co work better with a flat structure and we should work to export as CSV.

jqnatividad commented 5 years ago

Hi @davidread! What's the status of ckanext-analytics? Now that your Activity Stream work is further along, perhaps we can revisit this?

davidread commented 5 years ago

@jqnatividad ckanext-analytics is a proof of concept. I've not done anything with it since last summer. The activity stream is now a bit more robust (https://github.com/ckan/ckan/pull/4626) and saves the full dataset dict (https://github.com/ckan/ckan/pull/3972), so this would be a great time to revisit this work. tbh my clients aren't pushing on this at the moment, so very happy to pass the baton to you and see where it leads.

loleg commented 1 year ago

This is a great idea, and I am thinking it might also be related to #211 since data loading pipelines could also be used on CKAN's internal streams.