BackofenLab / Cherri

https://backofenlab.github.io/Cherri/
GNU General Public License v3.0
0 stars 0 forks source link

model_build not working #39

Closed teresa-m closed 1 year ago

teresa-m commented 2 years ago

Testing cherries model build did result in bad cross model performance. With the old runs, we got most of the time tree-based estimators. What we changed so far to improve the performance

  1. increase the optimization time to 12 h (43200s)
  2. set the number of jobs to '-1'
  3. set the memory to (64000/14)MB
  4. And also changed the no of jobs to increase the memory per job

This resulted in KNeigborsClassifier (Cross model performance still in testing)

Summary of possible error sources

Error source how to check
data: our new is not filtering if Chira could find an RRI 1. test run using old data (in prosess)
Eden features: I can only use 12 nbits. For higher once the process is killed I can not test this on my instances because I have to less ram. Maybe Stefan can do a run (if all other options fail) to emulate his first optimization
optimization 1.test run using old data 2. Check out the warning that only two models per run are tested. 3. Do a long run with only a few threads but very high MB per thread
general set: up there could be still something wrong within Cherri data generation 1. Test model build without Eden features
teresa-m commented 2 years ago

The following calls resulted in what estimator:

where dataset n_jobs MB per thread time estimator test score val score
denbi PARIS_human -1 4300 43200 KNeighborsClassifier(n_neighbors=2)} ? ?
denbi PARIS_human 7 8000 43200 KNeighborsClassifier(n_neighbors=1) 0.9769516007852452 0.931
Michi_PC PARIS_mouse -1 2000 43200 KNeighborsClassifier(n_neighbors=1, p=1, weights='distance')} 0.9806713376035464 0.941
Michi_PC PARIS_human_RBP 4 8000 50000 KNeighborsClassifier(n_neighbors=1) 0.9765021819402484 0.929
Michi_PC Old_PARIS_human 4 8000 50000 KNeighborsClassifier(n_neighbors=2, p=1, weights='distance') 0.9728710530759274 0.918
Stefan PARIS_human 7 8000 43200
teresa-m commented 2 years ago

I still get the 'init_dgesdd failed init' error within calling cherri

posible explanaintion Mybe for some estimatores the script fails an 8000 MB are not enagh. I changed the script so it will not crash if a error ist reported within the optimization.

teresa-m commented 2 years ago
Col:mod/Row:data PARIS_human PARIS_human_RBPs PARIS_mouse SPLASH
estimator HistGradientBoostingClassifier n_neighbors': 1 n_neighbors': 1 ExtraTreesClassifier
PARIS_human 0.839 0.84 0.52 0.63
PARIS_human_RBPs 0.93 0.856 0.54 0.61
PARIS_mouse 0.63 0.55 0.856 0.63
SPLASH 0.47 0.46 0.48 0.792
teresa-m commented 2 years ago

-> test old data with Eden features and my evaluation pipeline: PARIS_human$PARIS_human_rbps -> F1 score = 0.546 PARIS_human_rbps$PARIS_human -> F1 score = 0.504

teresa-m commented 2 years ago

Stefan original model Runs:

denbiubuntu~/r/biofilm$[84] shellpy showmodels.spy CHERRY/optimized (biofilm72) [65/1757] ################################################################################ ################################################################################ CHERRY/optimized/full_.model {'loss': 'auto', 'learning_rate': 0.27825205658418045, 'max_iter': 512, 'min_samples_leaf': 125, 'max_depth': None, 'max_leaf_nodes': 234, 'max_bins': 255, 'l2_regularization': 1.5724450377145835e-10, 'early_st op': 'train', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 11, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=True, l2_regularization=1.5724450377145835e-10, learning_rate=0.27825205658418045, max_iter=512, max_leaf_nodes=234, min_samples_leaf=125, n_iter_no_change=11, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': True} ################################################################################ ################################################################################ CHERRY/optimized/fullhuman.model {'loss': 'auto', 'learning_rate': 0.18214912973602268, 'max_iter': 512, 'min_samples_leaf': 70, 'max_depth': None, 'max_leaf_nodes': 287, 'max_bins': 255, 'l2_regularization': 2.444124445454329e-05, 'early_stop ': 'train', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 18, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=True, l2_regularization=2.444124445454329e-05, learning_rate=0.18214912973602268, max_iter=512, max_leaf_nodes=287, min_samples_leaf=70, n_iter_no_change=18, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': True} ################################################################################ ################################################################################ CHERRY/optimized/paris_humanRBPs.model {'loss': 'auto', 'learning_rate': 0.10694317220519729, 'max_iter': 512, 'min_samples_leaf': 42, 'max_depth': None, 'max_leaf_nodes': 1474, 'max_bins': 255, 'l2_regularization': 0.006245650100708052, 'early_stop ': 'off', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 0, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=False, l2_regularization=0.006245650100708052, learning_rate=0.10694317220519729, max_iter=512, max_leaf_nodes=1474, min_samples_leaf=42, n_iter_no_change=0, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': False} ################################################################################ ################################################################################ CHERRY/optimized/paris_humanRRI.model {'loss': 'auto', 'learning_rate': 0.03860685097120409, 'max_iter': 512, 'min_samples_leaf': 44, 'max_depth': None, 'max_leaf_nodes': 1032, 'max_bins': 255, 'l2_regularization': 3.70177817485565e-10, 'early_stop ': 'train', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 12, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=True, l2_regularization=3.70177817485565e-10, learning_rate=0.03860685097120409, max_iter=512, max_leaf_nodes=1032, min_samples_leaf=44, n_iter_no_change=12, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': True} ################################################################################ ################################################################################ CHERRY/optimized/paris_mouseRRI.model {'loss': 'auto', 'learning_rate': 0.18433753680428502, 'max_iter': 512, 'min_samples_leaf': 2, 'max_depth': None, 'max_leaf_nodes': 50, 'max_bins': 255, 'l2_regularization': 0.00040891478141833804, 'early_stop' : 'off', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 0, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=False, l2_regularization=0.00040891478141833804, learning_rate=0.18433753680428502, max_iter=512, max_leaf_nodes=50, min_samples_leaf=2, n_iter_no_change=0, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': False} ################################################################################ ################################################################################ CHERRY/optimized/paris_splash_humanRRI.model {'loss': 'auto', 'learning_rate': 0.09011344096058371, 'max_iter': 512, 'min_samples_leaf': 21, 'max_depth': None, 'max_leaf_nodes': 481, 'max_bins': 255, 'l2_regularization': 6.439073376265142e-06, 'early_stop' : 'off', 'tol': 1e-07, 'scoring': 'loss', 'n_iter_no_change': 0, 'validation_fraction': None, 'random_state': 1, 'verbose': 0, 'estimator': HistGradientBoostingClassifier(early_stopping=False, l2_regularization=6.439073376265142e-06, learning_rate=0.09011344096058371, max_iter=512, max_leaf_nodes=481, min_samples_leaf=21, n_iter_no_change=0, random_state=1, validation_fraction=None, warm_start=True), 'fullyfit': True, 'validationfraction': None, 'earlystopping': False} ################################################################################ ################################################################################ CHERRY/optimized/splash_humanRRI.model {'n_neighbors': 2, 'weights': 'distance', 'p': 1, 'random_state': 1, 'estimator': KNeighborsClassifier(n_neighbors=2, p=1, weights='distance')}

teresa-m commented 2 years ago

$$updated results $$

Most basic evaluation: old data with no eden features!

Col:mod/Row:data PARIS_human PARIS_human_RBPs PARIS_mouse
estimator KNeighborsClassifier(n_neighbors=1 HistGradientBoostingClassifier KNeighborsClassifier(n_neighbors=4
PARIS_human 0.843 0.863 0.538
PARIS_human_RBPs 0.906 0.847 0.538
PARIS_mouse 0.563 0.624 0.819
teresa-m commented 2 years ago

Stefan Results: human model -> mouse data F1: 0.700

mouse model -> human data F1: 0.623

Hier wurden die trusted RRIs neu berechnet. Sieht nicht sehr unterschiedlich von dem 'alten' f1 scores aus!

teresa-m commented 2 years ago

$$ udpdate values

Result fixed Pythonhashseed using Eden features:

my F1 scores from Cherri eval Col:mod/Row:data PARIS_human PARIS_human_RBPs PARIS_mouse
estimator KNeighborsClassifier(n_neighbors=2 HistGradientBoostingClassifier KNeighborsClassifier(n_neighbors=1,
PARIS_human 0.917 0.882 0.475
PARIS_human_RBPs 0.936 0.902 0.470
PARIS_mouse 0.575 0.662 0.931
teresa-m commented 2 years ago

Test : init_dgesdd failed init (on mouse data) It apperes calling Cherri on my and Stefan Denbi cloude. But also just calling biofilm for optimization: call

nohup python -W ignore -m biofilm.biofilm-optimize6  --infile //home/uhlm/Dokumente/Teresa/test_Cherri_old_data//PARIS_mouse/feature_files//training_data_PARIS_mouse_context_150  --featurefile //home/uhlm/Dokumente/Teresa/test_Cherri_old_data//PARIS_mouse//model//features/PARIS_mouse_context_150 --memoryMBthread 10000 --folds 0 --out //home/uhlm/Dokumente/Teresa/test_Cherri_old_data//PARIS_mouse//model//optimized/PARIS_mouse_context_150 --preprocess True --n_jobs 6 --time 50000 > test_only_mouse_model &

output:

adding .npz to filename
optimization datatype: <class 'numpy.ndarray'>
[WARNING] [2022-04-01 18:18:12,557:Client-AutoML(1):520fb4fc-b1d7-11ec-b0e5-901b0eb924fa] Capping the per_run_time_limit to 24999.0 to have time for a least 2 models in each process.
init_dgesdd failed init
init_dgesdd failed init
Traceback (most recent call last):
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/site-packages/biofilm/biofilm-optimize6.py", line 82, in <module>
    main()
  File "/home/uhlm/Progs/anaconda3/envs/cherri/lib/python3.8/site-packages/biofilm/biofilm-optimize6.py", line 78, in main
    print('\n',pipeline.steps[2][1].choice.preprocessor.get_support())
AttributeError: 'FastICA' object has no attribute 'get_support'
adding .npz to filename

########## CSV WRITTEN ##########

TEST score=0.9394774845739793
("{'n_neighbors': 1, 'weights': 'distance', 'p': 1, 'random_state': 1, "
 "'estimator': KNeighborsClassifier(n_neighbors=1, p=1, weights='distance')}")

########## MODEL WRITTEN ##########
pavanvidem commented 2 years ago

Result fixed Pythonhashseed using Eden features:

my F1 scores from Cherri eval

Col:mod/Row:data PARIS_human PARIS_human_RBPs PARIS_mouse estimator KNeighborsClassifier(n_neighbors=2 HistGradientBoostingClassifier KNeighborsClassifier(n_neighbors=1, PARIS_human 0.917 0.949
PARIS_human_RBPs 0.907 0.902
PARIS_mouse 0.931

These numbers look closer to the numbers in the supplementary file.

teresa-m commented 2 years ago

Hopefully, it is now working! I am currently evaluating the mouse cross model data. Hope this will be also similar. If this is the case, we maid have to switch back to the old way of data generation by just taking RRIs which had a detrected hybrid within Chira.

teresa-m commented 1 year ago

After fixing several buts the model building is working now.