Refine dataset preview page

weixuanfu commented 5 years ago

Add plots for features
Add description of metafeatures
OpenML example: https://www.openml.org/d/31
Useful tool to make non-interactive plots in python: seaborn
For interactive plots in JS: D3

hjwilli commented 5 years ago

After group discussion, we are going to use a tabbed interface for the datasets details page. This should provide the screen real estate necessary to add more visual analysis. We want to expand the page to provide some basic details about each of the dataset features, as well as tools that allow users to explore relationships in the data.

Rough draft image is here: pennai_datasets_rough1.pdf

First pass description of the new tabs are:

Summary
- header with number of rows, columns, classes, the target column, file name, and upload date
- table with one row per feature that has the feature name, feature type, and a distribution graph (see openML). The type of graph will depend on the feature type.
Data Preview (very similar to existing page)
- header with file details
- table of first 100 rows of data
Metadata (very similar to existing page)
- table with columns for field, value, and description
Analysis (or Exploration?)
- Still working on final design of this tab. will have elements like pairwise plots and heatmaps.

The first UI task is to:

update the current page to be tabbed
add the 'Data Preview' and 'Metadata' tabs. The content of these tabs is similar to what the page currently has in the side-by-side 'File Details' and 'Metafeatures' panes.

weixuanfu commented 5 years ago

python-nvd3 nvd3

hjwilli commented 5 years ago

this issue may show how to display mpld3 figures in javascript: https://github.com/mpld3/mpld3/issues/128

Look for mpld3.draw_figure()

joshc0044 commented 5 years ago

~~have proper chart based on col type (loop in order)~~ & transparent background & color scheme
approximately style as openml
boxplot & violins

set of d3 examples https://www.d3-graph-gallery.com/

Made progress with creating stacked bar charts in terms of raw functionality (about 80 % complete, still a few issues to iron out with the chart legend and all the styling)

Working with d3 is okay but many of the tutorials/examples/resources are from various points in the library's development past, some key features explained in some of the examples have changed but for the most part it's not too bad. Nevertheless, both Bill and Heather were right with my initial estimate of finishing this component was off, this might take a couple more days than I initially anticipated.

I need to continue learning d3 in order to properly leverage its features as I know the things I am trying to do are readily supported.

However, I do have a major concern with trying to generate some of these visualizations in realtime on the browser/client - depending on a typical dataset size (# of cols & rows) this may or may not be a concern. For reference the Adult dataset with 48842 rows takes about 5 seconds to load all the boxplots, trying to generate the stacked bar plot should take a similar amount of time, maybe a bit more. It might be better to do the calculations off the browser/client in one of the docker containers.

joshc0044 commented 5 years ago

Another problem that might be of concern is Javascripts number precision when doing calculations. I noticed a discrepancy with the boxplots for the banana dataset on openml and the one I am generating with d3.

Here is the openml page for the banana dataset with boxplots - https://www.openml.org/d/1460

I'm using the calculations outlined here to create the local boxplot - https://www.d3-graph-gallery.com/graph/boxplot_basic.html

And the results are different, I downloaded the csv from openml and it generates the same, slightly different boxplot as the old banana dataset, here is the one that I am generating locally - and

The only experience I have with boxplots is from an intro statistics course I took in college so I don't know what's wrong, the formulas appear to be correct so my guess is either an issue inherent to Javascript or the d3 library functions used to generated the statistics. It could also be another mistake.

One last thing is that on openml there are some datasets where the visualizations are weird - https://www.openml.org/d/1492 & https://www.openml.org/d/1504 This is dependent on what type of data is being used but our site will essentially do the same thing, as long as this is a known limitation then I am okay, I just want to bring some attention to this

joshc0044 commented 5 years ago

The updated dataset preview page is just about done in terms of functionality; there are three different types of charts generated - boxplots, a regular bar chart & stacked bar charts with basic styling

Both bar charts are mildly interactive in that one is able to hover over the bars and see the value

Barring any issues/bugs (from my understanding the charts are correct, I fixed the legend labels) or significant design changes, the only thing left to do is change/add styling

I am testing this with the appendicitis_cat_ord.csv dataset, I don't know of other datasets with categorical/ordinal features at the moment

joshc0044 commented 5 years ago

Further UI refinements & other tasks

~~order target class first & state target in parenthesis~~
~~center column content~~
~~horizontally separate rows with line/border~~
left justify columns in dataset preview page, center plot
~~unique color for each class~~
verify plotly.js does not send anything to 3rd party server

Update -

here is a screenshot of the aforementioned UI changes, the exact positioning & sizing needs to be fine tuned but I believe the general layout & format is close to what has been requested

Compare generated statistics in for boxplots in javascript d3 library with python pandas - apparent discrepancy in javascript
Investigate violin/histogram charts

Update -

Double checking statistics for boxplot for banana test dataset (https://www.openml.org/d/1460) In javascript here are the following stats for columns AT1 & AT2

AT1

q1: -0.75325
median: -0.01525
q3: 0.782
interQuantileRange: 1.53525

Whiskers

min: -3.056125
max: 3.0848750000000003

AT2

q1: -0.914
median: -0.0372
q3: 0.8225
interQuantileRange: 1.7365

Whiskers

min: -3.5187500000000003
max: 3.42725

The formulas being used are

let data_sorted = valByRowObj[tempKey].sort(d3.ascending);
// produces equivalent sorted list, sort in order to use d3.quantile(...)
//let data_sorted_AB = valByRowObj[tempKey].sort( (a, b) => {
 //    return a - b;
//});

q1 = d3.quantile(data_sorted, .25);
median = d3.quantile(data_sorted, .5);
q3 = d3.quantile(data_sorted, .75);
interQuantileRange = q3 - q1;
min = q1 - (1.5 * interQuantileRange);
max = q3 + (1.5 * interQuantileRange);

This is using d3 to sort the data and generate the quantiles/median, comparing these values against the same dataset loaded into python with pandas produces the same results with the above formulas.

The only remaining discrepancy is that the min/max values used for the whisker lines of the boxplot for the banana dataset on openml are still different in the boxplot I am producing: the ones on openml are about 1 less for each column respectively. I will continue to look into the final discrepancy for boxplots and start looking into creating violin plots/histograms

[ ] create corresponding unit test to check values of statistics & create overall pull request

joshc0044 commented 5 years ago

I spent some time today trying to figure out what's going on with the whisker lines in the boxplots I am generating in the webpage and compared them against plots/values generated with python pandas & matplotlib. Firstly, I did not realize that there are different boxplots that put the whisker lines around different values (I'm going off the wiki page for boxplots - https://en.wikipedia.org/wiki/Box_plot). I was under the impression that the whisker lines were always calculated according to the formula in the comment above. There are different formats where the whisker lines can be the min/max values of the dataset, when using the formula above to get the whisker lines the resulting chart is apparently called a turkey boxplot - I am trying to make this plot. With that said, I believe I was not calculating the min/max values correctly before which was, at least in part, causing the discrepancy.

Most of my effort today was spent looking at pandas and matplotlib to compare the statistics generated by those libraries with the values I am generating in the webpage/javascript; other than the min & max values for the whisker lines, all the other values appear to be correct. After looking at the documentation for matplotlib for configuring the whisker lines (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html), I changed how the min/max values are set. According to the docs, the whisker lines will at minimum be the smallest value in the data and vice versa for max values. Then I used the banana dataset to make a boxplot and it matches the boxplots I am generating on the webpage

with python and on the webpage

I believe both of these boxplots are essentially equivalent. However, I do not know what type of boxplot is being used on the openml website as that is different.

In short, the boxplot whisker lines are now only as small as the min value in the data or as large as the max. Before, if the calculation went out of bounds so to speak, the whisker lines would apparently be grossly off, they now at least match the boxplots generated with pandas/matplotlib. I also tried a few of the other datasets, tokyo1, iris and german

Going forward I am assuming that using the min/max values of the data as lower/upper bounds for the whisker lines is a suitable method.

I also spent time trying to change the styling, I am having a bit of trouble centering the column content of the grid with the existing semantic ui element. I'll have to try something else to replicate the design of the openml chart. Just for example, there's a flag to center/align all the column content but I'm not sure how to specify different options for different columns (this is without css, just vanilla semantic ui - https://react.semantic-ui.com/collections/grid/). I'll have to use a different semantic ui element or try some specific css styling. In any case here is screenshot with the charts shifted over There's still some empty space on the side that I can try to get rid of but I'll have to keep tweaking other semantic ui elements & css to make it work. There was also a border I disabled on the tabbed menu content

I also changed how the bar charts color are chosen, before it was just two colors hardcoded and now it is chosen with this function on a continuous scale https://github.com/d3/d3-scale-chromatic#cyclical, if there other color choices/schemes that are preferable please let me know.

joshc0044 commented 5 years ago

Also found some posts about the question of whether plotly.js sends data to and 3rd party servers

https://community.plot.ly/t/is-plotly-js-sending-data-to-plotly-servers-what-if-my-data-is-confidential/10256

It appears to be safe to use. This library should also make it significantly easier to make some of the charts, especially some of the more advanced charts.

hjwilli commented 5 years ago

Hi @joshc0044, nice research. It will be very helpful if plotly is something we can use.

Just a note to clarify how the colors should work. In the bar chart for the target feature, each bar/class should be a different color. In the stacked bar charts generated for categorical and ordinal features, the coloring for each of the classes should match the color of the corresponding class in the target chart. Please use https://www.openml.org/ for reference, here are some specific examples: (https://www.openml.org/d/4541 (see the gender, race and age features), https://www.openml.org/d/50)

joshc0044 commented 5 years ago

Hi @joshc0044, nice research. It will be very helpful if plotly is something we can use.

Just a note to clarify how the colors should work. In the bar chart for the target feature, each bar/class should be a different color. In the stacked bar charts generated for categorical and ordinal features, the coloring for each of the classes should match the color of the corresponding class in the target chart. Please use https://www.openml.org/ for reference, here are some specific examples: (https://www.openml.org/d/4541 (see the gender, race and age features), https://www.openml.org/d/50)

Thank you for the feedback; I changed the color for the stacked bar charts to match the unique color to class mapping.

This is from a subset of the diabetes openml example (https://www.openml.org/d/4541) as the entire dataset was too large to upload in pennAI, I copied the first two thousand rows and the last two thousand rows just to get a small test example. I also changed some of the styling of the chart x-axis labels by rotating them by 45 degrees. Using this diabetes dataset highlighted some issues - in openml they appear to have a cutoff when the resulting chart has too much stuff (for keys diag1 - diag3) And the chart for medical_speciality is probably around the upper limit of creating a chart that is still legible - this is what prompted me to rotate the x-axis label by 45 degrees and add a small tooltip when hovering over items in the a-axis to show the label.

Rotating the labels for the boxplot also resolved some issues when those values were too large and became garbled & jumbled together. Using the diabetes test dataset also highlighted an issue with some of the column keys; if the column keys contained a . in the name it would interfere with the creation of the charts as html tags with certain characters (a period or whitespace for example). There are checks for . and whitespace in the column names and I am handling those scenarios by replacing them with _ for the UI/website. So for example, the original column key of metformin.rosiglitazone is metformin_rosiglitazone in the UI

And for completeness here is the chart that does not appear in openml - they display the text Too many values to plot and I think they're right as the chart is a mess; I'll add a cutoff for the number of values to plot at something like 100 or so unique values and display the same message

for reference in the chart above there are 357 unique values for diag_1 (from the subset I have of the entire dataset, all 700 or so would be even worse)

I also attempted to upload the tic-tac-toe dataset from openml and ran into a problem; it causes this error

 registerDataset: 5d4b274450d3e70043103d9c
lab_1      | 0|lab    | validateDatafileByFileIdAsync ('5d4b274450d3e70043103d9c', 'Class', 'top-left-square,top-middle-square,top-right-square,middle-left-square,middle-middle-square,middle-right-square,bottom-left-square,bottom-middle-square,bottom-right-square', '')
lab_1      | 0|lab    | args: lab/pyutils/validateDataset.py,5d4b274450d3e70043103d9c,-target,Class,-identifier_type,fileid,-categorical_features,["top-left-square","top-middle-square","top-right-square","middle-left-square","middle-middle-square","middle-right-square","bottom-left-square","bottom-middle-square","bottom-right-square"],-ordinal_features,""
lab_1      | 0|lab    | GET /api/v1/files/5d4b274450d3e70043103d9c 200 26063 - 5.093 ms
lab_1      | 0|lab    | error in registerDataset: Error: Datafile validation failed, sklearn.check_array() validation Found array with 0 feature(s) (shape=(958, 0)) while a minimum of 1 is required.
lab_1      | 0|lab    | error: Error: Datafile validation failed, sklearn.check_array() validation Found array with 0 feature(s) (shape=(958, 0)) while a minimum of 1 is required.

Every column is nominal/categorical, am I doing something wrong or are there other datasets I can try to test?

Update/edit (here is the file I used which is a subset of the entire diabetes dataset): diabetes_small.zip

joshc0044 commented 5 years ago

After the feedback pointing out that the median lines were off in the meeting today, I looked at the boxplot piece, here is the code generating the median values from d3

    median_quantile = d3.quantile(data_sorted, .5);
    median = d3.median(data_sorted);

At first I was just using the value of median_quantile from the above as this is what is used in the calculations here in the example I originally used; I didn't realize that they are not equivalent values and using median in the snippet above should resolve the issue.

(forgot to tag in commit message)

joshc0044 commented 5 years ago

made a basic unit test to check statistics used for boxplots - for testing using values for banana dataset - https://www.openml.org/d/1460

      At1: {
        q1: -0.75325,
        q3: 0.7820,
        median: -0.01525,
        min: -3.056125,
        max: 2.81, // 3.0848750000000003 with formula
        min_val_in_data: -3.09,
        max_val_in_data: 2.81
      },
      At2: {
        q1: -0.91400,
        q3: 0.8225,
        median: -0.03720,
        min: -2.39, // -3.5187500000000003 with formula
        max: 3.19, // 3.42725
        min_val_in_data: -2.39,
        max_val_in_data: 3.19
      },
      class: {
        q1: -1,
        q3: 1,
        median: -1,
        min: -1,
        max: 1,
        min_val_in_data: -1,
        max_val_in_data: 1
      }

lacava commented 5 years ago

I didn't realize that they are not equivalent values and using median in the snippet above should resolve the issue.

those should be equivalent, i'm puzzled as to why they wouldn't be. did you confirm that d3.median gives the correct median?

hjwilli commented 3 years ago

Restarting this effort with #309

EpistasisLab / Aliro

Refine dataset preview page #209