jdd-software commented 6 years ago

Hydrant - Web dashboard for Heat Detector

These are my ideas and suggestions what would be nice for HD.

Background

HeatDetector currently classify all comments on SO with regex (3 files, low,medium,high) and 3 NLP classifiers; OpenNLP, NaiveBayes and Perspective. Perspective is a google api while OpenNLP e NaiveBayes are built on two files good comments and bad comments (approx. 4000 comments each).

The classifiers gives a score 0-1 (eg 0.91) of bad the comment is, the regex files gives a score (depending in which file it is present, low,medium,high).

Together a final score of how bad comments is create 0-10, 4 or higher can be reported in chat (while 6 is default value).

To improve the bot, the key issue are to find good regex to use, understand how different classifiers are performing (how should they contribute to final score) and improve comments in good and bad feed (not duplicates, correctly classified etc, avoiding problems as "stupid" in perspective feed). Do note that the current feed may need reviewing.

Web dashboard

Data to send

Comments data id comment, creation date, body, link, user id, user reputation, comment, final score, regex that was hit, regex score, naive bayes score, OpenNLP score, Perspective score.

Feedback data id comment, id user, feedback (tp, fp, nc, tn)

Data to view

Related to user access (Public, Reviewer, Admin)

Public

Real-time feed of comments.
Filter on type of regex and score of different classifiers (with pie chart on feedback type) <- see sentinel (Objective is to get an idea of how good different regex and classifiers are (eg what score on classifier gives what %tp, also a a line graph score vs %tp %fp would be nice or related to 1 specific regex what are the %tp,%fp)
Search on key words in body (search with like queries)

Reviewer

View comments with no feedback
View current comments feed (good/bad)
Send feedback to comments

Admin

Invalidate feedback (set correct feedback), it only has sense if the comment is marked to be included in feed.
View and filter on user and user reputation.
Manage current good and bad feed (remove existing from feed and add new from comments feed).

Probably it is better to replace @petter with @xxx and not show user passing comment if user access is not admin.

rjrudman commented 6 years ago

I've got a MockUp for how I'd imagine HeatDetector should work. However, I've tried to keep it as generic as possible to support any bot we need to hookup. Here's what I've got so far (all fake data, of course):

rjrudman commented 6 years ago

If people are okay with the above design, and available information, I think I'm happy to move forward with implementing the requirements for bots. cc @ArcticEcho @Bhargav-Rao

As an MVP, I'd like to get out the above dashboard, and the endpoints for bot authentication (still debating the way we're going to authenticate bots via the API) & reporting. Once that's all in place, we can hook it all up, and then slowly implement additional features (real time feeds, statistics, etc).

Authorization and authentication is all setup (authentication done via StackExchange), so it shouldn't be an issue to show different content depending on the user's role.

jdd-software commented 6 years ago

@rjrudman Seems very nice, my additional suggestions are:

Have an interface where you can see multiple posts (as https://sentinel.erwaysoftware.com/review), hence no need need to click through every post when reviewing, instead multiple post on single page.
I will continue to stress the function of handling machine learning feeds (I think this can be useful for many bots, eg also on Natty we have been thinking to add macchine learning), which would add admin functions as "Add to good feed", "Add to bad feed"

The feeds are normale 2 (good and bad), theoretically they could be more (but if needed you could limiti to 2).

The main problem with feeds are that you collect data then you need to:

Review the feeds (remove content that does not fit), this is a very labor intensive process, so if multiple people (admin) can work on-line it would be a big step to improve to process. I immagine 2 actions on line in feed (approve, remove), hence you can filter on feed to view what has not been approved/reviewed yet.
Add to feed as new data arrives (selecting the best data and not duplicating content already in feed), hence a quick search on phrase to understand what is already present and a button add to feed (good/bad)
Fix feeds, machine learning has a problem that often by chance certain words are present in only 1 side of feed. For example I had this problem with "downvote", it was only present in bad feed, most macchine learning systems will when a sentence contains the word, trigger it. The solution is to find "good" comments containing the word then adding to other side of feed. Perspective has a similar problem with "Stupid". To do this efficiently, you search on "phrase" or "word", the system display's all content present in current feed (in both good and bad), this way you understand if there is a problem. Then you search on new stuff to find content to add to feed.

Finally, it would be nice also to have information what happened to report (eg. MetaSmoke's deleted within x minutes)

Bhargav-Rao commented 6 years ago

Looks Beautiful, Just a few small updates:

Let the logo be in the header. Something like how the sobotics logo is on Redunda.
There should be a disclaimer that the data displayed is from the SE API.
It would be nice to see if which reason tripped the report. Queen reports value for each of the 4 types (regex, OpenNLP, NaiveBayes and Perspective), but it highlights the reason in bold. I'm not sure how you can do this generically for Higgs, but would be needed for Hydrant.

rjrudman commented 6 years ago

@jdd-software Definitely planning on having a similar /review system!

As far as the machine learning goes, I believe it should be covered. Here's the workflow I envision for bots:

An admin registers a bot
The bot, on startup, authenticates, and hits /Bots/RegisterFeedbackTypes

This API allows a bot to configure what feedback types are available for the reports it creates. Feedback types are unique per bot and name. A request might look like:

{
  "feedbackTypes": [
    {
      "name": "True Positive",
      "colour": "green",
      "icon": "string",
      "isActionable": true,
      "requiredActions": 0
    },
    {
      "name": "False Positive",
      "colour": "red",
      "icon": "string",
      "isActionable": true,
      "requiredActions": 0
    },
    {
      "name": "Needs editing",
      "colour": "yellow",
      "icon": "string",
      "isActionable": false,
      "requiredActions": 0
    }
  ]
}

If the feedback type already exists, it'll be updated. Otherwise, it'll be created. 'IsActionable' is yet to be decided how it'll work, but I'm imagining it to allow Higgs to determine whether or not to take the feedback into account when marking a report as resolved. 'RequiredActions' is the number of that action required to resolve a report (which may make 'IsActionable' redundant).

Next, the bot reports a post:

{
  "title": "Some reported post",
  "contentUrl": "https://sobotics.org",
  "detectionScore": 99.9,
  "content": "hey, this is offensive @rob",
  "obfuscatedContent": "hey, this is offensive @xxx",
  "authorName": "Rob",
  "authorReputation": 999999999,
  "contentCreationDate": "2018-02-21T08:38:02.666Z",
  "detectedDate": "2018-02-21T08:38:02.666Z",
  "reasons": [
    {
      "reasonName": "opennpl",
      "confidence": 99.9
    },
    {
      "reasonName": "regex - Rob",
      "confidence": 100
    }
  ],
  "allowedFeedback": [
    "True Positive", "False Positive", "Needs editing"
  ],
  "attributes": [
    {
      "key": "string",
      "value": "string"
    }
  ]
}

Here, they pass the names of the feedback types supported for this report. Each feedback type will render one of the buttons above (with the colour specified).

The bot will have access to an API to query statistics about feedback types, which would allow it feed information into its machine learning. I'd imagine this would be external to Higgs - it's up to the bot to decide the weighting here.

Would this approach be suitable for what you need?

As for 'what happened to that post', I'm tending towards letting the bot update us. Higgs could do it for posts and comments, but it might become complicated when other types of content are reported. Would an API which updates an existing report (maybe appending 'Attributes') be sufficient here?

rjrudman commented 6 years ago

@Bhargav-Rao

With the logo, I'm not too sure. Each bot (potentially) has its own logo; the picture above is for a report specific to a bot. It won't work for other pages, for example, viewing a list of reports (which could come from different bots). I'm also considering how to display a report if multiple bots reported the post.
I'll add that in (maybe in the footer?)
Good point. I'll add that as part of the /RegisterPost/ api for the bot. In this case, we'll sort the reasons by whether it triggered or not, and then by the confidence.

rjrudman commented 6 years ago

Also, I've created a chatroom here so that we don't have to pollute SOBotics with Higgs discussions

jdd-software commented 6 years ago

@rjrudman related to json

First setup json probably will be sent from postman or in the future a config interface. (does not seem related directly to bot dev)
I think you should add an id (long) in json, all reports will have a specific id and it can be used to traceback communication. I also have post-elaborated comment (content), hence comment arrives, I elaborate it remove pings, remove intentional repetion ecc, macchine learning/regex is run on final content. However I'm not sure if this has sense to add it in json, mostly informing you.
Related to macchine learning, initially it would be enough that you can download the raw feed from web interface, txt file.

rjrudman commented 6 years ago

@jdd-software Won't matter to the API where it comes from - the only important thing is it's signed by the key registered to the bot

With the ID: The server will respond with the post ID from /RegisterPost/. From there, posts can be queried/modified by that ID. I'm not sure I'm following what you mean with the post-elaborated comment? Are you meaning if the content that was reported changes?

The dumps should be no problem at all; and if we're really in a bind, we can always just query the database manually :)

jdd-software commented 6 years ago

Further explanation in chat