Open EngineerDanny opened 5 months ago
looks like the sparsity of necromass data is consistent with public data (not larger as I was expecting), so that contradicts my hypothesis that increased sparsity in necromass may be causing the learning algorithm to not work (large test error rates).
can you please also compute the sparsity of the updated data from Briana on Dec 22?
Sure. I am on it. Although, I am thinking that maybe I should also add the sparsity of the groups(the sub-samples). Since you used that in your computation. Maybe your hypothesis is not wrong. I will update on this soon
@tdhock This is the results when the labels are replaced with 1(positive numbers) and 0(negative numbers) to try to address the sparsity problem.
The X axis says MSE, is that right? You may consider making one plot like above with accuracy/error. And another plot with AUC. The legend says LogisticRegression, did you use L1 regularization? What did you use for features? log transform? 0/1?
The MSE is wrong, it is rather the accuracy (I have updated it). Yh I will add the AUC curve next. I used LogisticRegressionCV, the default penalty is L2 so this is using L2 regularization (I will change it to L1 and update). I used the log transform features (I know it's not ideal but that's what I started with).
log transform is fine. I thought the default was L1? (not L2?) I thought it may have been accuracy and not MSE. In that case it looks like linear model is sometimes better than featureless, but not often. so not very encouraging. maybe worth adding 0/1 features + other transformations (with scaling) to the feature matrix, to see if that helps. It could be that the necromass data are just hard to predict, so I think the next step should be to try it on all of the other data sets, to see if it does any better.
@tdhock This is the results for the
MSE
on test set against the # of Total Samples
GGM just doesn't work without standard scaling.
@tdhock This is the results for the
MSE
on test set against the# of Total Samples
GGM just doesn't work without standard scaling.
Can you please write your hypothesis (why do wanted to make this plot) and interpretation (does it confirm your hypothesis?) You should probably do facet_wrap(scales="free") for this plot, so we can see the differences better. There is no reason the MSE should be on the same scale at all between the different panels.
@tdhock This is the result using the other public data sets. As expected when using all the samples that there will be a considerable test error difference between Featureless and LassoCV.
Should the y axis be "test fold" ? Is that the original LassoCV? Can you please add results for LogisticRegressionCV + LassoCV? (to see if predicting 0/1 helps lower test MSE, relative to LassoCV by itself?)
@tdhock This is the result for the FoldId against the MSE for Featureless
, LassoCV
and LogisticReg+LassoCV
.
I don't think LogisticReg+LassoCV
does better than LassoCV
.
@tdhock This is the result for the new architecture.
looks like it is about the same as other methods in this data set. that is consistent with our hypothesis that there is little to learn in these data. have you tried the new method on the other public data?
@tdhock , Yh I tested it on the other data sets. Results are consistent with that of the necromass data set.
??? what are those ??? are they from public data like amgut etc?
Yh. They are
is that result consistent with the other figure? (one panel per data set, instead of one panel per column)
This is the table for the publicly available grouped-samples microbiome data:
Data | Source | No of Taxa | No of Samples | No of Groups | Sparsity |
---|---|---|---|---|---|
HPv13 | PubMed 22699609 | 5830 | 3285 | 71 | 98.16% |
HPv35 | PubMed 22699609 | 10730 | 6000 | 152 | 98.71% |
MovingPictures | PubMed 21624126 | 22765 | 1967 | 6 | 97.06% |
qa10394 | mSystems e00021-16 | 9719 | 1418 | 16 | 94.28% |
TwinsUK | PubMed 25417156 | 8480 | 1024 | 16 | 87.70% |
Public Data set
Necromass Data set
Dec22_bacteria_fungi_conservative_r_same_raw