Closed weixuanfu closed 3 years ago
After group discussion, we are going to use a tabbed interface for the datasets details page. This should provide the screen real estate necessary to add more visual analysis. We want to expand the page to provide some basic details about each of the dataset features, as well as tools that allow users to explore relationships in the data.
Rough draft image is here: pennai_datasets_rough1.pdf
First pass description of the new tabs are:
The first UI task is to:
this issue may show how to display mpld3 figures in javascript: https://github.com/mpld3/mpld3/issues/128
Look for mpld3.draw_figure()
set of d3 examples https://www.d3-graph-gallery.com/
Made progress with creating stacked bar charts in terms of raw functionality (about 80 % complete, still a few issues to iron out with the chart legend and all the styling)
Working with d3 is okay but many of the tutorials/examples/resources are from various points in the library's development past, some key features explained in some of the examples have changed but for the most part it's not too bad. Nevertheless, both Bill and Heather were right with my initial estimate of finishing this component was off, this might take a couple more days than I initially anticipated.
I need to continue learning d3 in order to properly leverage its features as I know the things I am trying to do are readily supported.
However, I do have a major concern with trying to generate some of these visualizations in realtime on the browser/client - depending on a typical dataset size (# of cols & rows) this may or may not be a concern. For reference the Adult
dataset with 48842 rows takes about 5 seconds to load all the boxplots, trying to generate the stacked bar plot should take a similar amount of time, maybe a bit more. It might be better to do the calculations off the browser/client in one of the docker containers.
Another problem that might be of concern is Javascripts number precision when doing calculations. I noticed a discrepancy with the boxplots for the banana
dataset on openml and the one I am generating with d3.
Here is the openml page for the banana dataset with boxplots - https://www.openml.org/d/1460
I'm using the calculations outlined here to create the local boxplot - https://www.d3-graph-gallery.com/graph/boxplot_basic.html
And the results are different, I downloaded the csv from openml and it generates the same, slightly different boxplot as the old banana dataset, here is the one that I am generating locally - and
The only experience I have with boxplots is from an intro statistics course I took in college so I don't know what's wrong, the formulas appear to be correct so my guess is either an issue inherent to Javascript or the d3 library functions used to generated the statistics. It could also be another mistake.
One last thing is that on openml there are some datasets where the visualizations are weird - https://www.openml.org/d/1492 & https://www.openml.org/d/1504 This is dependent on what type of data is being used but our site will essentially do the same thing, as long as this is a known limitation then I am okay, I just want to bring some attention to this
The updated dataset preview page is just about done in terms of functionality; there are three different types of charts generated - boxplots, a regular bar chart & stacked bar charts with basic styling
Both bar charts are mildly interactive in that one is able to hover over the bars and see the value
Barring any issues/bugs (from my understanding the charts are correct, I fixed the legend labels) or significant design changes, the only thing left to do is change/add styling
I am testing this with the appendicitis_cat_ord.csv
dataset, I don't know of other datasets with categorical/ordinal features at the moment
Further UI refinements & other tasks
Update -
here is a screenshot of the aforementioned UI changes, the exact positioning & sizing needs to be fine tuned but I believe the general layout & format is close to what has been requested
Update -
Double checking statistics for boxplot for banana test dataset (https://www.openml.org/d/1460) In javascript here are the following stats for columns AT1 & AT2
AT1
Whiskers
AT2
Whiskers
The formulas being used are
let data_sorted = valByRowObj[tempKey].sort(d3.ascending);
// produces equivalent sorted list, sort in order to use d3.quantile(...)
//let data_sorted_AB = valByRowObj[tempKey].sort( (a, b) => {
// return a - b;
//});
q1 = d3.quantile(data_sorted, .25);
median = d3.quantile(data_sorted, .5);
q3 = d3.quantile(data_sorted, .75);
interQuantileRange = q3 - q1;
min = q1 - (1.5 * interQuantileRange);
max = q3 + (1.5 * interQuantileRange);
This is using d3 to sort the data and generate the quantiles/median, comparing these values against the same dataset loaded into python with pandas produces the same results with the above formulas.
The only remaining discrepancy is that the min/max values used for the whisker lines of the boxplot for the banana dataset on openml are still different in the boxplot I am producing: the ones on openml are about 1 less for each column respectively. I will continue to look into the final discrepancy for boxplots and start looking into creating violin plots/histograms
I spent some time today trying to figure out what's going on with the whisker lines in the boxplots I am generating in the webpage and compared them against plots/values generated with python pandas & matplotlib. Firstly, I did not realize that there are different boxplots that put the whisker lines around different values (I'm going off the wiki page for boxplots - https://en.wikipedia.org/wiki/Box_plot). I was under the impression that the whisker lines were always calculated according to the formula in the comment above. There are different formats where the whisker lines can be the min/max values of the dataset, when using the formula above to get the whisker lines the resulting chart is apparently called a turkey boxplot - I am trying to make this plot. With that said, I believe I was not calculating the min/max values correctly before which was, at least in part, causing the discrepancy.
Most of my effort today was spent looking at pandas and matplotlib to compare the statistics generated by those libraries with the values I am generating in the webpage/javascript; other than the min & max values for the whisker lines, all the other values appear to be correct. After looking at the documentation for matplotlib for configuring the whisker lines (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html), I changed how the min/max values are set. According to the docs, the whisker lines will at minimum be the smallest value in the data and vice versa for max values. Then I used the banana dataset to make a boxplot and it matches the boxplots I am generating on the webpage
with python and on the webpage
I believe both of these boxplots are essentially equivalent. However, I do not know what type of boxplot is being used on the openml website as that is different.
In short, the boxplot whisker lines are now only as small as the min value in the data or as large as the max. Before, if the calculation went out of bounds so to speak, the whisker lines would apparently be grossly off, they now at least match the boxplots generated with pandas/matplotlib. I also tried a few of the other datasets, tokyo1, iris and german
Going forward I am assuming that using the min/max values of the data as lower/upper bounds for the whisker lines is a suitable method.
I also spent time trying to change the styling, I am having a bit of trouble centering the column content of the grid with the existing semantic ui element. I'll have to try something else to replicate the design of the openml chart. Just for example, there's a flag to center/align all the column content but I'm not sure how to specify different options for different columns (this is without css, just vanilla semantic ui - https://react.semantic-ui.com/collections/grid/). I'll have to use a different semantic ui element or try some specific css styling. In any case here is screenshot with the charts shifted over There's still some empty space on the side that I can try to get rid of but I'll have to keep tweaking other semantic ui elements & css to make it work. There was also a border I disabled on the tabbed menu content
I also changed how the bar charts color are chosen, before it was just two colors hardcoded and now it is chosen with this function on a continuous scale https://github.com/d3/d3-scale-chromatic#cyclical, if there other color choices/schemes that are preferable please let me know.
Also found some posts about the question of whether plotly.js sends data to and 3rd party servers
It appears to be safe to use. This library should also make it significantly easier to make some of the charts, especially some of the more advanced charts.
Hi @joshc0044, nice research. It will be very helpful if plotly is something we can use.
Just a note to clarify how the colors should work. In the bar chart for the target feature, each bar/class should be a different color. In the stacked bar charts generated for categorical and ordinal features, the coloring for each of the classes should match the color of the corresponding class in the target chart. Please use https://www.openml.org/ for reference, here are some specific examples: (https://www.openml.org/d/4541 (see the gender, race and age features), https://www.openml.org/d/50)
Hi @joshc0044, nice research. It will be very helpful if plotly is something we can use.
Just a note to clarify how the colors should work. In the bar chart for the target feature, each bar/class should be a different color. In the stacked bar charts generated for categorical and ordinal features, the coloring for each of the classes should match the color of the corresponding class in the target chart. Please use https://www.openml.org/ for reference, here are some specific examples: (https://www.openml.org/d/4541 (see the gender, race and age features), https://www.openml.org/d/50)
Thank you for the feedback; I changed the color for the stacked bar charts to match the unique color to class mapping.
This is from a subset of the diabetes openml example (https://www.openml.org/d/4541) as the entire dataset was too large to upload in pennAI, I copied the first two thousand rows and the last two thousand rows just to get a small test example. I also changed some of the styling of the chart x-axis labels by rotating them by 45 degrees. Using this diabetes dataset highlighted some issues - in openml they appear to have a cutoff when the resulting chart has too much stuff (for keys diag1
- diag3
)
And the chart for medical_speciality
is probably around the upper limit of creating a chart that is still legible - this is what prompted me to rotate the x-axis label by 45 degrees and add a small tooltip when hovering over items in the a-axis to show the label.
Rotating the labels for the boxplot also resolved some issues when those values were too large and became garbled & jumbled together. Using the diabetes test dataset also highlighted an issue with some of the column keys; if the column keys contained a .
in the name it would interfere with the creation of the charts as html tags with certain characters (a period or whitespace for example). There are checks for .
and whitespace in the column names and I am handling those scenarios by replacing them with _
for the UI/website. So for example, the original column key of metformin.rosiglitazone
is metformin_rosiglitazone
in the UI
And for completeness here is the chart that does not appear in openml - they display the text Too many values to plot
and I think they're right as the chart is a mess; I'll add a cutoff for the number of values to plot at something like 100 or so unique values and display the same message
for reference in the chart above there are 357 unique values for diag_1 (from the subset I have of the entire dataset, all 700 or so would be even worse)
I also attempted to upload the tic-tac-toe dataset from openml and ran into a problem; it causes this error
registerDataset: 5d4b274450d3e70043103d9c
lab_1 | 0|lab | validateDatafileByFileIdAsync ('5d4b274450d3e70043103d9c', 'Class', 'top-left-square,top-middle-square,top-right-square,middle-left-square,middle-middle-square,middle-right-square,bottom-left-square,bottom-middle-square,bottom-right-square', '')
lab_1 | 0|lab | args: lab/pyutils/validateDataset.py,5d4b274450d3e70043103d9c,-target,Class,-identifier_type,fileid,-categorical_features,["top-left-square","top-middle-square","top-right-square","middle-left-square","middle-middle-square","middle-right-square","bottom-left-square","bottom-middle-square","bottom-right-square"],-ordinal_features,""
lab_1 | 0|lab | GET /api/v1/files/5d4b274450d3e70043103d9c 200 26063 - 5.093 ms
lab_1 | 0|lab | error in registerDataset: Error: Datafile validation failed, sklearn.check_array() validation Found array with 0 feature(s) (shape=(958, 0)) while a minimum of 1 is required.
lab_1 | 0|lab | error: Error: Datafile validation failed, sklearn.check_array() validation Found array with 0 feature(s) (shape=(958, 0)) while a minimum of 1 is required.
Every column is nominal/categorical, am I doing something wrong or are there other datasets I can try to test?
Update/edit (here is the file I used which is a subset of the entire diabetes dataset): diabetes_small.zip
After the feedback pointing out that the median lines were off in the meeting today, I looked at the boxplot piece, here is the code generating the median values from d3
median_quantile = d3.quantile(data_sorted, .5);
median = d3.median(data_sorted);
At first I was just using the value of median_quantile
from the above as this is what is used in the calculations here in the example I originally used; I didn't realize that they are not equivalent values and using median
in the snippet above should resolve the issue.
made a basic unit test to check statistics used for boxplots - for testing using values for banana
dataset - https://www.openml.org/d/1460
At1: {
q1: -0.75325,
q3: 0.7820,
median: -0.01525,
min: -3.056125,
max: 2.81, // 3.0848750000000003 with formula
min_val_in_data: -3.09,
max_val_in_data: 2.81
},
At2: {
q1: -0.91400,
q3: 0.8225,
median: -0.03720,
min: -2.39, // -3.5187500000000003 with formula
max: 3.19, // 3.42725
min_val_in_data: -2.39,
max_val_in_data: 3.19
},
class: {
q1: -1,
q3: 1,
median: -1,
min: -1,
max: 1,
min_val_in_data: -1,
max_val_in_data: 1
}
I didn't realize that they are not equivalent values and using
median
in the snippet above should resolve the issue.
those should be equivalent, i'm puzzled as to why they wouldn't be. did you confirm that d3.median
gives the correct median?
Restarting this effort with #309