dgplug / infrastructure

8 stars 0 forks source link

Planet blog feed analysis #4

Open shakthimaan opened 6 years ago

shakthimaan commented 6 years ago

A report is required for dgplug students and planet.dgplug.org to answer:

  1. The number of posts per month
  2. The interval between posts per user
  3. The users who have not posted for over a month
  4. The number of words per blog post per user

If the information can be fed into a database periodically using an application container, a Grafana dashboard can be constructed for the same.

farhaanbukhsh commented 6 years ago

This seems really interesting, I have a little experience with grafana but let me do a setup and lets see how we can better visualize it.

farhaanbukhsh commented 6 years ago

So I tried setting up grafana, was able to do this with the docker image that grafana has. I am thinking of using feedparser and give the github raw url to the feedparser of planet pages [1] and [2]. For now I am thinking we could run this script as a cron and generate the data. I have not explored the data source part but I feel a simple MySQL or Postgres can do it, but what I really loved and would like to use here is influxDB [3].

Once data is captured performing queries over it should not be very difficult. My only concern is a neat way to get data for each blog and populate it in influxdb and this should be done incrementally for example what if new blog is updated now I don't want all the information what I want is just the new blog.

I am thinking about writing a service which can listen to such kind of events. Frankly with grafana I feel the visualization is taken care of, the data collection part is the challenge here.

Schubisu commented 6 years ago

@farhaanbukhsh I'm not sure if I understand that correctly; When using feedparser, to answer the questions from @shakthimaan above, imho you would need to save the following fields:

if you check your db for the unique post id before inserting data, you're not going to have duplicates. It could also be discussed to link multiple blogs of single authors, as this special case might occur more often. This would however require some manual editing of the db.