EngineerDanny commented 5 months ago

Public Data set

Dataset	Sparsity (%)
amgut1_data	30.40
amgut2_data	34.60
baxter_crc_data	27.78
crohns_data	1.00
glne007_data	58.88
global_patterns_data	79.07
esophagus_data	43.10
enterotype_data	67.62
hmp2prot_data	14.05
hmp216S_data	12.67
mixmpln_real_data	69.82
soilrep_data	69.82
ioral_data	43.10

Necromass Data set

Dataset	Sparsity (%)
bacteria_conservative_raw	29.44
bacteria_genus_raw	69.48
fungi_conservative_raw	56.01
fungi_genus_raw	85.66
bacteria_fungi_conservative_raw	41.99
Dec22_bacteria_conservative_r_same_raw	35.39
Dec22_fungi_conservative_r_same_raw	58.05
Dec22_bacteria_fungi_conservative_r_same_raw	46.42

Dec22_bacteria_fungi_conservative_r_same_raw

Group	Sparsity (%)
AllSoilM1M3	44.07
LowMelanM1	46.62
HighMelanM1	51.84
LowMelanM3	43.00
HighMelanM3	48.87

tdhock commented 5 months ago

looks like the sparsity of necromass data is consistent with public data (not larger as I was expecting), so that contradicts my hypothesis that increased sparsity in necromass may be causing the learning algorithm to not work (large test error rates).

tdhock commented 5 months ago

can you please also compute the sparsity of the updated data from Briana on Dec 22?

EngineerDanny commented 5 months ago

Sure. I am on it. Although, I am thinking that maybe I should also add the sparsity of the groups(the sub-samples). Since you used that in your computation. Maybe your hypothesis is not wrong. I will update on this soon

EngineerDanny commented 5 months ago

Binary Classification

@tdhock This is the results when the labels are replaced with 1(positive numbers) and 0(negative numbers) to try to address the sparsity problem.

binary_classifier

tdhock commented 5 months ago

The X axis says MSE, is that right? You may consider making one plot like above with accuracy/error. And another plot with AUC. The legend says LogisticRegression, did you use L1 regularization? What did you use for features? log transform? 0/1?

EngineerDanny commented 5 months ago

The MSE is wrong, it is rather the accuracy (I have updated it). Yh I will add the AUC curve next. I used LogisticRegressionCV, the default penalty is L2 so this is using L2 regularization (I will change it to L1 and update). I used the log transform features (I know it's not ideal but that's what I started with).

tdhock commented 5 months ago

log transform is fine. I thought the default was L1? (not L2?) I thought it may have been accuracy and not MSE. In that case it looks like linear model is sometimes better than featureless, but not often. so not very encouraging. maybe worth adding 0/1 features + other transformations (with scaling) to the feature matrix, to see if that helps. It could be that the necromass data are just hard to predict, so I think the next step should be to try it on all of the other data sets, to see if it does any better.

EngineerDanny commented 5 months ago

Binary Classification + Regression

output_class_reg

EngineerDanny commented 5 months ago

output @tdhock This is the results for the MSE on test set against the # of Total Samples GGM just doesn't work without standard scaling.

tdhock commented 5 months ago

@tdhock This is the results for the MSE on test set against the # of Total Samples GGM just doesn't work without standard scaling.

Can you please write your hypothesis (why do wanted to make this plot) and interpretation (does it confirm your hypothesis?) You should probably do facet_wrap(scales="free") for this plot, so we can see the differences better. There is no reason the MSE should be on the same scale at all between the different panels.

tdhock commented 5 months ago

@tdhock This is the result using the other public data sets. As expected when using all the samples that there will be a considerable test error difference between Featureless and LassoCV.

Should the y axis be "test fold" ? Is that the original LassoCV? Can you please add results for LogisticRegressionCV + LassoCV? (to see if predicting 0/1 helps lower test MSE, relative to LassoCV by itself?)

EngineerDanny commented 5 months ago

@tdhock This is the result for the FoldId against the MSE for Featureless, LassoCV and LogisticReg+LassoCV. I don't think LogisticReg+LassoCV does better than LassoCV.

output

EngineerDanny commented 4 months ago

@tdhock This is the result for the new architecture.

output

tdhock commented 4 months ago

looks like it is about the same as other methods in this data set. that is consistent with our hypothesis that there is little to learn in these data. have you tried the new method on the other public data?

EngineerDanny commented 4 months ago

@tdhock , Yh I tested it on the other data sets. Results are consistent with that of the necromass data set.

output

tdhock commented 4 months ago

??? what are those ??? are they from public data like amgut etc?

EngineerDanny commented 4 months ago

Yh. They are

tdhock commented 4 months ago

is that result consistent with the other figure? (one panel per data set, instead of one panel per column)

EngineerDanny commented 4 months ago

output

tdhock commented 4 months ago

https://www.andrewheiss.com/blog/2022/05/09/hurdle-lognormal-gaussian-brms/#normally-distributed-outcomes-with-zeros plot-pp-mass-bill-hurdle-log-1

tdhock commented 4 months ago

https://www.statsmodels.org/devel/examples/notebooks/generated/count_hurdle.html

tdhock commented 4 months ago

https://search.r-project.org/CRAN/refmans/EnvStats/html/ZeroModifiedLognormal.html

EngineerDanny commented 2 months ago

This is the table for the publicly available grouped-samples microbiome data:

Data	Source	No of Taxa	No of Samples	No of Groups	Sparsity
HPv13	PubMed 22699609	5830	3285	71	98.16%
HPv35	PubMed 22699609	10730	6000	152	98.71%
MovingPictures	PubMed 21624126	22765	1967	6	97.06%
qa10394	mSystems e00021-16	9719	1418	16	94.28%
TwinsUK	PubMed 25417156	8480	1024	16	87.70%

EngineerDanny / necromass

Sparsity Table #2

Public Data set

Necromass Data set

Dec22_bacteria_fungi_conservative_r_same_raw

Binary Classification

Binary Classification + Regression