EngineerDanny / necromass

1 stars 0 forks source link

Sparsity Table #2

Open EngineerDanny opened 5 months ago

EngineerDanny commented 5 months ago

Public Data set

Dataset Sparsity (%)
amgut1_data 30.40
amgut2_data 34.60
baxter_crc_data 27.78
crohns_data 1.00
glne007_data 58.88
global_patterns_data 79.07
esophagus_data 43.10
enterotype_data 67.62
hmp2prot_data 14.05
hmp216S_data 12.67
mixmpln_real_data 69.82
soilrep_data 69.82
ioral_data 43.10

Necromass Data set

Dataset Sparsity (%)
bacteria_conservative_raw 29.44
bacteria_genus_raw 69.48
fungi_conservative_raw 56.01
fungi_genus_raw 85.66
bacteria_fungi_conservative_raw 41.99
Dec22_bacteria_conservative_r_same_raw 35.39
Dec22_fungi_conservative_r_same_raw 58.05
Dec22_bacteria_fungi_conservative_r_same_raw 46.42

Dec22_bacteria_fungi_conservative_r_same_raw

Group Sparsity (%)
AllSoilM1M3 44.07
LowMelanM1 46.62
HighMelanM1 51.84
LowMelanM3 43.00
HighMelanM3 48.87
tdhock commented 5 months ago

looks like the sparsity of necromass data is consistent with public data (not larger as I was expecting), so that contradicts my hypothesis that increased sparsity in necromass may be causing the learning algorithm to not work (large test error rates).

tdhock commented 5 months ago

can you please also compute the sparsity of the updated data from Briana on Dec 22?

EngineerDanny commented 5 months ago

Sure. I am on it. Although, I am thinking that maybe I should also add the sparsity of the groups(the sub-samples). Since you used that in your computation. Maybe your hypothesis is not wrong. I will update on this soon

EngineerDanny commented 5 months ago

Binary Classification

@tdhock This is the results when the labels are replaced with 1(positive numbers) and 0(negative numbers) to try to address the sparsity problem.

binary_classifier

tdhock commented 5 months ago

The X axis says MSE, is that right? You may consider making one plot like above with accuracy/error. And another plot with AUC. The legend says LogisticRegression, did you use L1 regularization? What did you use for features? log transform? 0/1?

EngineerDanny commented 5 months ago

The MSE is wrong, it is rather the accuracy (I have updated it). Yh I will add the AUC curve next. I used LogisticRegressionCV, the default penalty is L2 so this is using L2 regularization (I will change it to L1 and update). I used the log transform features (I know it's not ideal but that's what I started with).

tdhock commented 5 months ago

log transform is fine. I thought the default was L1? (not L2?) I thought it may have been accuracy and not MSE. In that case it looks like linear model is sometimes better than featureless, but not often. so not very encouraging. maybe worth adding 0/1 features + other transformations (with scaling) to the feature matrix, to see if that helps. It could be that the necromass data are just hard to predict, so I think the next step should be to try it on all of the other data sets, to see if it does any better.

EngineerDanny commented 5 months ago

Binary Classification + Regression

output_class_reg

EngineerDanny commented 5 months ago

output @tdhock This is the results for the MSE on test set against the # of Total Samples GGM just doesn't work without standard scaling.

tdhock commented 5 months ago

output @tdhock This is the results for the MSE on test set against the # of Total Samples GGM just doesn't work without standard scaling.

Can you please write your hypothesis (why do wanted to make this plot) and interpretation (does it confirm your hypothesis?) You should probably do facet_wrap(scales="free") for this plot, so we can see the differences better. There is no reason the MSE should be on the same scale at all between the different panels.

tdhock commented 5 months ago

@tdhock This is the result using the other public data sets. As expected when using all the samples that there will be a considerable test error difference between Featureless and LassoCV.

output

Should the y axis be "test fold" ? Is that the original LassoCV? Can you please add results for LogisticRegressionCV + LassoCV? (to see if predicting 0/1 helps lower test MSE, relative to LassoCV by itself?)

EngineerDanny commented 5 months ago

@tdhock This is the result for the FoldId against the MSE for Featureless, LassoCV and LogisticReg+LassoCV. I don't think LogisticReg+LassoCV does better than LassoCV.

output

EngineerDanny commented 4 months ago

@tdhock This is the result for the new architecture.

output

tdhock commented 4 months ago

looks like it is about the same as other methods in this data set. that is consistent with our hypothesis that there is little to learn in these data. have you tried the new method on the other public data?

EngineerDanny commented 4 months ago

@tdhock , Yh I tested it on the other data sets. Results are consistent with that of the necromass data set.

output

tdhock commented 4 months ago

??? what are those ??? are they from public data like amgut etc?

EngineerDanny commented 4 months ago

Yh. They are

tdhock commented 4 months ago

is that result consistent with the other figure? (one panel per data set, instead of one panel per column)

EngineerDanny commented 4 months ago

output

tdhock commented 4 months ago

https://www.andrewheiss.com/blog/2022/05/09/hurdle-lognormal-gaussian-brms/#normally-distributed-outcomes-with-zeros plot-pp-mass-bill-hurdle-log-1

tdhock commented 4 months ago

https://www.statsmodels.org/devel/examples/notebooks/generated/count_hurdle.html

tdhock commented 4 months ago

https://search.r-project.org/CRAN/refmans/EnvStats/html/ZeroModifiedLognormal.html

EngineerDanny commented 2 months ago

This is the table for the publicly available grouped-samples microbiome data:

Data Source No of Taxa No of Samples No of Groups Sparsity
HPv13 PubMed 22699609 5830 3285 71 98.16%
HPv35 PubMed 22699609 10730 6000 152 98.71%
MovingPictures PubMed 21624126 22765 1967 6 97.06%
qa10394 mSystems e00021-16 9719 1418 16 94.28%
TwinsUK PubMed 25417156 8480 1024 16 87.70%