IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

User usage statistic #2729

Closed pengchengluo closed 7 years ago

pengchengluo commented 8 years ago

As far as I know, dataverse doesn't have the usage statistic function and doesn't log any user usage information, such as who download the data, view dataset page and dataverse page. However, this is a very important function for data providers. In our university, the faculty who provide datasets are eager to know who use the data, when they download the data, how they use it. User usage information will help the data providers to understand their users.

So we wonder whether dataverse will provide such function in the future releases. For us we hope dataverse can provide it. Now, we have to do it by ourselves. Log some user behavior information, index it in the elasticsearch and present the web page.

bencomp commented 8 years ago

Dataverse has Guestbooks that log downloads and Google Analytics support built-in. I vowed for alternatives like Piwik in #1594. There is a table in the database that does log user actions called the action log (after my request in #1532).

Logging to the server log can be improved, though, like requested in #2575.

Also related: #568, #2485.

(edited to add a note about guestbooks and more issue references.)

pdurbin commented 8 years ago

The term we often use is "metrics" and here are some related issues: #1971 #2101 #2417

As @bencomp mentioned, for downloads and uh, explorations anyway, you could look at the "guestbookresponse" table:

dvndb=> select id,downloadtype,name,datafile_id, dataset_id from guestbookresponse;
 id | downloadtype |      name       | datafile_id | dataset_id 
----+--------------+-----------------+-------------+------------
  1 | Download     | *********       |          18 |         16
  2 | Download     | *************** |          86 |         82
  3 | Download     | Guest           |          88 |         82
  4 | Download     | Guest           |          86 |         82
  5 | Download     | ************    |          51 |         49
  6 | Download     | Guest           |          91 |         90
  7 | Explore      | Guest           |          91 |         90
  8 | Download     | Guest           |          91 |         90
  9 | Download     | Guest           |          91 |         90
 10 | Download     | Guest           |          91 |         90
 11 | Explore      | Guest           |          91 |         90
 12 | Download     | Guest           |          91 |         90
 13 | Download     | Guest           |          91 |         90
 14 | Download     | Guest           |          92 |         90
 15 | Explore      | Guest           |          91 |         90
 16 | Download     | Guest           |          92 |         90
 17 | Explore      | Guest           |          91 |         90
 18 | Download     | Guest           |          91 |         90
 19 | Explore      | Guest           |         107 |        101
 20 | Explore      | Guest           |          91 |         90
 21 | Download     | Guest           |          91 |         90
 22 | Download     | Guest           |          92 |         90
 23 | Download     | Guest           |          91 |         90
 24 | Download     | Guest           |          70 |         69
(24 rows)

Right, you enable access logs for Glassfish and use a tool to generate stats. I'd actually be very interested in your script to put this into Elasticsearch and what sort of queries you do, and how you present this on a web page @pengchengluo ! Maybe we could use it on http://dataverse.org as part of #2417.

@bencomp also mentioned Google Analytics. @eaquigley has figured out how to tell what people are searching on.

pdurbin commented 8 years ago

Oh, and I imagine some of this will be useful on dashboard for superusers: #840

That said, I understand that this issue has a focus on the individual researchers. They'd like to know who is downloading their data and how it's being used. @pengchengluo maybe you can give us feedback on the Guestbook feature: http://guides.dataverse.org/en/4.2.1/user/dataverse-management.html#dataset-guestbooks . I'm sure some improvements could be made.

pengchengluo commented 8 years ago

Thanks for @pdurbin and @bencomp 's help! The actionlogrecord and guestbookresponse table indeed record some useful information we need.

However, in my opinion, it will be more efficiency if the data is indexed in the search engine such as elasticsearch. With the increasing of log data the burden of database will increase and this will effect the performance of other functions. Elasticsearch has the horizontal scalability and can deal with large scale data, it is a good choice to use it as the log store and analysis engine.

I record some user behavior such as view dataverse, view dataset, download file, request join explicit user group (we implement in the web interface), accept user's request, reject user's request and so on. The log contain some basic information such as event type, ip address, timestamp, referrer,user id and other useful information such as dataverse id, dataset id, group id. The log information is sent to elasticsearch in real time using the JestClient. Dataverse admin can view and search these log event in the web interface.

Following is an example of dataverse view statistic

pkudvn

Following is an example of datafile download statistic

pkudvn2

posixeleni commented 8 years ago

:+1:

eaquigley commented 8 years ago

Hi @pengchengluo this is looking really cool! Are you planning on submitting a pull request for this so we can test it with the source code? Would love to be able to get this onto one of our test servers so I can play around with it and see what the user experience is like. Thanks for doing this!

pdurbin commented 8 years ago

@pengchengluo I agree completely with @eaquigley that your visualizations are fantastic! Please do let us know if you're interested in contributing some code!


Meanwhile, I just heard about @metabase at https://medium.com/@metabase/why-we-picked-clojure-448bf759dc83 (someone tweeted it) which prompted me to go listen to https://changelog.com/182/ with @tlrobinson and @salsakran and it seems really cool! I threw in on https://demo.dataverse.org (only took a few minutes; it's just a jar to start, like Solr) and started playing around with the guestbookresponse table:

metabase_-_2015-12-03_19 49 51

Metabase seems very promising! More at http://metabase.com and I highly recommend the blog post and podcast episode above.

pengchengluo commented 8 years ago

Hi @pdurbin, the faculties in our university are interested in who download their data and the downloaders' detail information. The google analytic just collect anonymous usage information. Therefore, we add this usage statistic function.

@eaquigley , @pdurbin We are very glad to share our code!

pdurbin commented 8 years ago

Are you planning on submitting a pull request for this so we can test it with the source code? Would love to be able to get this onto one of our test servers so I can play around with it and see what the user experience is like.

@eaquigley it looks like @pengchengluo just made a pull request at #2818 . Did you have a test server in mind?

eaquigley commented 8 years ago

@pdurbin beta.dataverse.org would be the test server in mind since it is intended to be the test machine that shows new features and functionality.

bencomp commented 8 years ago

It would be interesting to see the performance under load, as ElasticSearch can scale, but for it to scale I think you need multiple machines. Something for locust?

Server capacity at DANS is low, so although this work is very interesting, I would probably want to wait with setting up all of this.

eaquigley commented 8 years ago

@bencomp it would be interesting to set this up on one of our test machines (beta.dataverse.org) and run locust against it to see what happens to performance of the application.

pdurbin commented 7 years ago

Since Piwik is mentioned above, just a heads up that support for Pwiki just merged (pull request #3374).

Also, on today's community call we talked about http://dataverse.org/metrics and how it's pointed at https://dataverse.harvard.edu but the code that powers it ( https://github.com/IQSS/miniverse ) can be pointed at any Dataverse installation. Here's a link to the notes: https://docs.google.com/document/d/1Bvxg8NxU3LV0yRBp5X-qOU1u7Ede7-HHDtwjuHPTJyc/edit?usp=sharing

pdurbin commented 7 years ago

I'm closing this now that people can install https://github.com/IQSS/miniverse . If anyone objects, please let me know.

pdurbin commented 6 years ago

Please note that there is some activity regarding gathering metrics at #4169.