cassandra-project / platform

The CASSANDRA platform
Apache License 2.0
8 stars 2 forks source link

Distribution visualisation issues #223

Closed diou closed 11 years ago

diou commented 11 years ago

Some distribution normalization issues:

  1. When displaying histogram distributions, a histogram view is preferred.
  2. When displaying a normal distribution, the range of view should be:

[max(0, mean - 8 * std) min(1440, mean + 8 * std)]

so that the distribution is visible.

kyrcha commented 11 years ago

@diou Except for the number of times distribution?

diou commented 11 years ago

Yes, sorry, you are right. For number of times only histogram should be possible.

fgiannar commented 11 years ago

In order to get this, is it related to what we had discussed on #141 ? Chart type depending on user plot type selection ?

kyrcha commented 11 years ago

For number of times the graph should always be a bar chart, but histogram goes with values and the other distributions with parameters. For start time and duration distributions again histogram goes with values and the other distributions with parameters, but in this case the graphs also change. In the case of start time and distribution histogram goes with bar chart and the other with line graph like it is now. @diou Correct?

diou commented 11 years ago

Yes, as long as the bar chart displays correctly for the large number of values (e.g. for start time or if duration values exceed 200 minutes). If there's a problem when the number of bars is large we should again use lines.

To sum up, here are the requirements for displaying distributions:

  1. When displaying histogram distributions, a bar chart is preferred. If the bar chart won't display correctly for a large number of values (e.g. 200, as mentioned above) then you can use lines. This is up to you.
  2. Duration and start time distributions can be of any type (Histogram, Uniform, Normal or Gaussian Mixture).
  3. Number of times can only be a histogram
  4. When displaying a normal distribution, the range of view should be: [max(0, mean - 8 * std) min(1440, mean + 8 * std)]
  5. When the user selects "Histogram" then the text box below should be "Values". When the user selects any other type of distribution, the text box below should be "Parameters".

I hope I'm not forgetting something. If something is unclear or you don't agree please let me know.

kyrcha commented 11 years ago

About item 3 you are talking about the chart type or also about the distribution type. Could I have in number of times distribution a normal distribution with mean: 2 and std: 0.5? Or should it be only Histrogram type with values [0.5, 0.4, 0.1, 0, 0]?

diou commented 11 years ago

The latter, I think the distribution type should only be a histogram. This is because number of times is a discrete distribution of integer values and therefore there's not much point in modelling it as a continuous parametric distribution.

kyrcha commented 11 years ago

OK. @fgiannar I will assign you this task. If you need anything server-side let us know.

fgiannar commented 11 years ago

Just to confirm, the following range applies to y-axis values, correct? [max(0, mean - 8 * std) min(1440, mean + 8 * std)]

diou commented 11 years ago

The y-axis values are probabilities. The above values refer to the x-axis.

kyrcha commented 11 years ago

They are referring to the x axis: So for example if mean is 100 and std is 10 then it should be in [20, 180]. If mean is 100 and std is 20 then in [0, 260]. (I see @diou has already answered, but since I wrote it :))

fgiannar commented 11 years ago

And one more thing, the small bar chart won't display correctly for a large number of values (>=100) (when clicking on it, the larger chart popup appears where the bars are displayed properly since there is more space for the chart). So how would you prefer to handle this? Have a filter and when the values are >= 100 and distr type is Histogram display line chart or use a line chart in all cases?

kyrcha commented 11 years ago

I am fine with the >= 100 case, but we have to see the look and feel first. Perhaps it could be >=50...

diou commented 11 years ago

I agree, you can set a threshold (e.g. 50, 100 whatever works) and if the number of values exceeds it, use a line.

diou commented 11 years ago

Also one more comment: The distributions are in the [0, 1] range while histograms are in the [0, 100] range (i.e. percentages). It would be nice to be consistent (in the descriptions too).

fgiannar commented 11 years ago

ok, thnx

fgiannar commented 11 years ago

When trying to make a PUT request in distributions, with data as follows: { _id: "50eaafe1e4b0e21868c64ce6", actmod_id: "5045f2b4e4b058c3f86c3301", description: "", distrType: "Histogram", name: "duration", values:[3, 4, 5, 6, 7, 8], parameters: [{mean:30, std:10}] }

I get the following exception:

{ "success": false, "errors": { "Exception": "Null" }, "message": "MongoQueryError: Cannot execute find query for collection: distributions with qKey={ \"_id\" : { \"$oid\" : \"50eaafe1e4b0e21868c64ce6\"}} and qValue={ \"actmod_id\" : 1 , \"description\" : 1 , \"distrType\" : 1 , \"name\" : 1 , \"values\" : 1 , \"parameters\" : 1}" }

Is it possible that the requests are allowed to contain parameters as well as values? If the distrType is "Histogram", then the values will be read, otherwise the parameters?.This will allow the user to switch between distrTypes, without having to re-enter values and parameters.

fgiannar commented 11 years ago

Btw, if this is not easy to implement it's ok I'll reset the corresponding field (e.g. if distrType = "Histogram", set parameters = [],) before sending a request.

fgiannar commented 11 years ago

I followed the above implementation (resetting corresponding values based on user selection). Some questions before we can close this issue: 1) @diou : "The distributions are in the [0, 1] range while histograms are in the [0, 100] range (i.e. percentages)." Where exactly would you like me to add this information? On the question-mark tooltip text maybe when hovering on parameter or values info? 2) Would it be easy to convert the plot values server-side (I am referring to multiplying those values by 100 when distrType != "Histogram")

diou commented 11 years ago

You don't need to add an explanation. I would suggest to NOT use percentages in the histograms and to remove the '%' from the description. This is because these distributions are actually probability density functions and can have values > 1 and the use of percentages doesn't make sense.

@kyrcha: Perhaps the percentile values in the histograms is a server-side issue?

fgiannar commented 11 years ago

Ok I'll remove the '%'. But still when distrType is anything but Histogram, the y values should be multiplied with 100. I was wondering if this is easy to be implemented server side. Please let me know so I can start working on it if necessary.

kyrcha commented 11 years ago

The values should not be multiplied by 100. We decided to keep them as probabilities and not as percentages. So the y-axis should be whatever values the user enters and the x-axis should start from 0 and be integer values for each bar. For example if the user enters [0.3, 0.2, 0.1, 0.5] the (x,y) points should be [(0,0.3), (1, 0.2), (2, 0.1), (3, 0.5)]

fgiannar commented 11 years ago

ok clear, then I'll remove the conversion (x100) in the next commit, and if you think everything is ok we can close this.

diou commented 11 years ago

Just checked the code, these are NOT density functions (they are probability distributions) so please also remove the 'density' word as well. It should be 'Probability vs Duration'. The same also for the enlarged graph view.

Thanks!

kyrcha commented 11 years ago

@fgiannar one more thing as well...are you multiplying with 100 (in order to get percentages) the values we send in duration and start times? If yes, then remove the multiplication since we are switching to probabilities, if no then let us know since we need to change it server side. Thnx

diou commented 11 years ago

Uniform distribution is also not displayed (e.g. try [{"start":100, "end":200}] which is the example given in the manual)

diou commented 11 years ago

Also, for some reason when I insert a GMM duration, the x axis display is cut-off. E.g. try [{"w":0.6, "mean":350, "std": 10},{"w":0.4, "mean":420, "std":10}], which is cutoff around 420 for some reason.

diou commented 11 years ago

One way to correctly visualize the GMMs is to use the rule

low_i = max(0, mean_i - 8 * std_i) high_i = min(1440, mean_i + 8 * std_i)

for each mixture component i and the use the bounds

[min(low_i), max(high_i)] where the min, max operators are over i.

fgiannar commented 11 years ago

@kyrcha Yes, I was multiplying, so it 'll be removed on the latest commit @diou Maybe I'm missing this, 'cause I don't remember talking about how to handle Uniform or GMM distributions(e.g. [{"start":100, "end":200}] is supported so far ). So we have to clear this:

diou commented 11 years ago

Yes, we didn't ask for this. However, there are two issues:

  1. GMMs are cut off in the x-axis, without following any specific pattern
  2. Uniform dists are not displayed.

Both these bugs are reproducible by the examples I gave in the above comments. We don't need to set bounds for GMM and uniform right now (it's not currently important) but these bugs should be fixed if possible.

Thanks

fgiannar commented 11 years ago

@diou One more question: Is it possible that in Normal Distribution the parameters are: [{"w":0.6, "mean":350, "std": 10},{"w":0.4, "mean":420, "std":10}]? If yes, then the rule:" [max(0, mean - 8 * std) min(1440, mean + 8 * std)]" is not applicable. Should we maybe switch to "[min(low_i), max(high_i)]" for all types except Histogram?

diou commented 11 years ago

After the latest update the x-axis labels are not visible (in both graph views)