model_build not working

teresa-m commented 2 years ago

Testing cherries model build did result in bad cross model performance. With the old runs, we got most of the time tree-based estimators. What we changed so far to improve the performance

increase the optimization time to 12 h (43200s)
set the number of jobs to '-1'
set the memory to (64000/14)MB
And also changed the no of jobs to increase the memory per job

This resulted in KNeigborsClassifier (Cross model performance still in testing)

Summary of possible error sources

Error source	how to check
data: our new is not filtering if Chira could find an RRI	1. test run using old data (in prosess)
Eden features: I can only use 12 nbits. For higher once the process is killed	I can not test this on my instances because I have to less ram. Maybe Stefan can do a run (if all other options fail) to emulate his first optimization
optimization	1.test run using old data 2. Check out the warning that only two models per run are tested. 3. Do a long run with only a few threads but very high MB per thread
general set: up there could be still something wrong within Cherri data generation	1. Test model build without Eden features

teresa-m commented 2 years ago

The following calls resulted in what estimator:

where	dataset	n_jobs	MB per thread	time	estimator	test score	val score
denbi	PARIS_human	-1	4300	43200	KNeighborsClassifier(n_neighbors=2)}	?	?
denbi	PARIS_human	7	8000	43200	KNeighborsClassifier(n_neighbors=1)	0.9769516007852452	0.931
Michi_PC	PARIS_mouse	-1	2000	43200	KNeighborsClassifier(n_neighbors=1, p=1, weights='distance')}	0.9806713376035464	0.941
Michi_PC	PARIS_human_RBP	4	8000	50000	KNeighborsClassifier(n_neighbors=1)	0.9765021819402484	0.929
Michi_PC	Old_PARIS_human	4	8000	50000	KNeighborsClassifier(n_neighbors=2, p=1, weights='distance')	0.9728710530759274	0.918
Stefan	PARIS_human	7	8000	43200

teresa-m commented 2 years ago

I still get the 'init_dgesdd failed init' error within calling cherri

posible explanaintion Mybe for some estimatores the script fails an 8000 MB are not enagh. I changed the script so it will not crash if a error ist reported within the optimization.

teresa-m commented 2 years ago

rerun the non Eden feature models (no cross Val for the model itself -> added test score of model finding)

Col:mod/Row:data	PARIS_human	PARIS_human_RBPs	PARIS_mouse	SPLASH
estimator	HistGradientBoostingClassifier	n_neighbors': 1	n_neighbors': 1	ExtraTreesClassifier
PARIS_human	0.839	0.84	0.52	0.63
PARIS_human_RBPs	0.93	0.856	0.54	0.61
PARIS_mouse	0.63	0.55	0.856	0.63
SPLASH	0.47	0.46	0.48	0.792

teresa-m commented 2 years ago

-> test old data with Eden features and my evaluation pipeline: PARIS_human$PARIS_human_rbps -> F1 score = 0.546 PARIS_human_rbps$PARIS_human -> F1 score = 0.504

teresa-m commented 2 years ago

Stefan original model Runs:

denbiubuntu~/r/biofilm$[84] shellpy showmodels.spy CHERRY/optimized (biofilm72) [65/1757] ################################################################################ ################################################################################ CHERRY/optimized/full_.model {'loss': 'auto', 'learning_rate': 0.27825205658418045, 'max_iter': 512, 'min_samples_leaf': 125, 'max_depth': None, 'max_leaf_nodes': 234, 'max_bins': 255, 'l2_regularization': 1.5724450377145835e-10, 'early_st op': 'train', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 11, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=True, l2_regularization=1.5724450377145835e-10, learning_rate=0.27825205658418045, max_iter=512, max_leaf_nodes=234, min_samples_leaf=125, n_iter_no_change=11, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': True} ################################################################################ ################################################################################ CHERRY/optimized/fullhuman.model {'loss': 'auto', 'learning_rate': 0.18214912973602268, 'max_iter': 512, 'min_samples_leaf': 70, 'max_depth': None, 'max_leaf_nodes': 287, 'max_bins': 255, 'l2_regularization': 2.444124445454329e-05, 'early_stop ': 'train', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 18, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=True, l2_regularization=2.444124445454329e-05, learning_rate=0.18214912973602268, max_iter=512, max_leaf_nodes=287, min_samples_leaf=70, n_iter_no_change=18, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': True} ################################################################################ ################################################################################ CHERRY/optimized/paris_humanRBPs.model {'loss': 'auto', 'learning_rate': 0.10694317220519729, 'max_iter': 512, 'min_samples_leaf': 42, 'max_depth': None, 'max_leaf_nodes': 1474, 'max_bins': 255, 'l2_regularization': 0.006245650100708052, 'early_stop ': 'off', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 0, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=False, l2_regularization=0.006245650100708052, learning_rate=0.10694317220519729, max_iter=512, max_leaf_nodes=1474, min_samples_leaf=42, n_iter_no_change=0, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': False} ################################################################################ ################################################################################ CHERRY/optimized/paris_humanRRI.model {'loss': 'auto', 'learning_rate': 0.03860685097120409, 'max_iter': 512, 'min_samples_leaf': 44, 'max_depth': None, 'max_leaf_nodes': 1032, 'max_bins': 255, 'l2_regularization': 3.70177817485565e-10, 'early_stop ': 'train', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 12, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=True, l2_regularization=3.70177817485565e-10, learning_rate=0.03860685097120409, max_iter=512, max_leaf_nodes=1032, min_samples_leaf=44, n_iter_no_change=12, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': True} ################################################################################ ################################################################################ CHERRY/optimized/paris_mouseRRI.model {'loss': 'auto', 'learning_rate': 0.18433753680428502, 'max_iter': 512, 'min_samples_leaf': 2, 'max_depth': None, 'max_leaf_nodes': 50, 'max_bins': 255, 'l2_regularization': 0.00040891478141833804, 'early_stop' : 'off', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 0, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=False, l2_regularization=0.00040891478141833804, learning_rate=0.18433753680428502, max_iter=512, max_leaf_nodes=50, min_samples_leaf=2, n_iter_no_change=0, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': False} ################################################################################ ################################################################################ CHERRY/optimized/paris_splash_humanRRI.model {'loss': 'auto', 'learning_rate': 0.09011344096058371, 'max_iter': 512, 'min_samples_leaf': 21, 'max_depth': None, 'max_leaf_nodes': 481, 'max_bins': 255, 'l2_regularization': 6.439073376265142e-06, 'early_stop' : 'off', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 0, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=False, l2_regularization=6.439073376265142e-06, learning_rate=0.09011344096058371, max_iter=512, max_leaf_nodes=481, min_samples_leaf=21, n_iter_no_change=0, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': False} ################################################################################ ################################################################################ CHERRY/optimized/splash_humanRRI.model {'n_neighbors': 2, 'weights': 'distance', 'p': 1, 'random_state': 1, 'estimator': KNeighborsClassifier(n_neighbors=2, p=1, weights='distance')}

teresa-m commented 2 years ago

$$updated results $$

Most basic evaluation: old data with no eden features!

Col:mod/Row:data	PARIS_human	PARIS_human_RBPs	PARIS_mouse
estimator	KNeighborsClassifier(n_neighbors=1	HistGradientBoostingClassifier	KNeighborsClassifier(n_neighbors=4
PARIS_human	0.843	0.863	0.538
PARIS_human_RBPs	0.906	0.847	0.538
PARIS_mouse	0.563	0.624	0.819

teresa-m commented 2 years ago

Stefan Results: human model -> mouse data F1: 0.700

mouse model -> human data F1: 0.623

Hier wurden die trusted RRIs neu berechnet. Sieht nicht sehr unterschiedlich von dem 'alten' f1 scores aus!

teresa-m commented 2 years ago

$$ udpdate values

Result fixed Pythonhashseed using Eden features:

my F1 scores from Cherri eval	Col:mod/Row:data	PARIS_human	PARIS_human_RBPs
estimator	KNeighborsClassifier(n_neighbors=2	HistGradientBoostingClassifier	KNeighborsClassifier(n_neighbors=1,
PARIS_human	0.917	0.882	0.475
PARIS_human_RBPs	0.936	0.902	0.470
PARIS_mouse	0.575	0.662	0.931

teresa-m commented 2 years ago

Test : init_dgesdd failed init (on mouse data) It apperes calling Cherri on my and Stefan Denbi cloude. But also just calling biofilm for optimization: call

nohup python -W ignore -m biofilm.biofilm-optimize6  --infile //home/uhlm/Dokumente/Teresa/test_Cherri_old_data//PARIS_mouse/feature_files//training_data_PARIS_mouse_context_150  --featurefile //home/uhlm/Dokumente/Teresa/test_Cherri_old_data//PARIS_mouse//model//features/PARIS_mouse_context_150 --memoryMBthread 10000 --folds 0 --out //home/uhlm/Dokumente/Teresa/test_Cherri_old_data//PARIS_mouse//model//optimized/PARIS_mouse_context_150 --preprocess True --n_jobs 6 --time 50000 > test_only_mouse_model &

output:

adding .npz to filename
optimization datatype: <class 'numpy.ndarray'>
[WARNING] [2022-04-01 18:18:12,557:Client-AutoML(1):520fb4fc-b1d7-11ec-b0e5-901b0eb924fa] Capping the per_run_time_limit to 24999.0 to have time for a least 2 models in each process.
init_dgesdd failed init
init_dgesdd failed init
Traceback (most recent call last):
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/site-packages/biofilm/biofilm-optimize6.py", line 82, in <module>
    main()
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/site-packages/biofilm/biofilm-optimize6.py", line 78, in main
    print('\n',pipeline.steps[2][1].choice.preprocessor.get_support())
AttributeError: 'FastICA' object has no attribute 'get_support'
adding .npz to filename

########## CSV WRITTEN ##########

TEST score=0.9394774845739793
("{'n_neighbors': 1, 'weights': 'distance', 'p': 1, 'random_state': 1, "
 "'estimator': KNeighborsClassifier(n_neighbors=1, p=1, weights='distance')}")

########## MODEL WRITTEN ##########

pavanvidem commented 2 years ago

Result fixed Pythonhashseed using Eden features:

my F1 scores from Cherri eval

Col:mod/Row:data PARIS_human PARIS_human_RBPs PARIS_mouse estimator KNeighborsClassifier(n_neighbors=2 HistGradientBoostingClassifier KNeighborsClassifier(n_neighbors=1, PARIS_human 0.917 0.949
PARIS_human_RBPs 0.907 0.902
PARIS_mouse 0.931

These numbers look closer to the numbers in the supplementary file.

teresa-m commented 2 years ago

Hopefully, it is now working! I am currently evaluating the mouse cross model data. Hope this will be also similar. If this is the case, we maid have to switch back to the old way of data generation by just taking RRIs which had a detrected hybrid within Chira.

teresa-m commented 1 year ago

After fixing several buts the model building is working now.

BackofenLab / Cherri

model_build not working #39

Most basic evaluation: old data with no eden features!

Result fixed Pythonhashseed using Eden features:

my F1 scores from Cherri eval