bzhanglab / AutoRT

AutoRT: Peptide retention time prediction using deep learning
GNU General Public License v3.0
26 stars 7 forks source link

Training From Scratch Crashing #17

Closed bertauchekurtis closed 2 years ago

bertauchekurtis commented 2 years ago

Hello. I'm trying to train a model from scratch but I'm unable to run Auto RT. Every time I try to train the model I receive the following error, and I'm not sure what to do about it. Thanks. :)

C:\Users\Kurtis\Desktop\Research\AutoRT-master>py autort.py train -i C:\Users\Kurtis\Desktop\Research\data\autoRTTrainingSet.tsv -o C:\Users\Kurtis\Desktop\Research\data Scaling method: min_max Step 1:

Traceback (most recent call last): File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort.py", line 158, in main() File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort.py", line 109, in main two_step_ensemble_models(input_data=input_file, nb_epoch=epochs, batch_size=batch_size, File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort\RTModels.py", line 675, in two_step_ensemble_models ensemble_models(models_file=models_file, input_data=input_data, ensemble_method=ensemble_method, File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort\RTModels.py", line 735, in ensemble_models with open(models_file, "r") as read_file: TypeError: expected str, bytes or os.PathLike object, not NoneType

wenbostar commented 2 years ago

You have to specify the model file path:

py autort.py train -e 100 -b 64 -m models/base_model/model.json -u m -i C:\Users\Kurtis\Desktop\Research\data\autoRTTrainingSet.tsv -sm min_max -rlr -n 20 -o C:\Users\Kurtis\Desktop\Research\data

What type of variable modifications does your training data contain? If there are only variable modifications from the list below, I highly recommend you to use our pre-trained models with transfer learning. This could significantly increase RT prediction performance compared with training a model from scratch.

Oxidation (M), Phosphorylation (STY), Acetylation (K), Ubiquitination (K).
bertauchekurtis commented 2 years ago

Ah, I see. Thank you so much for the very fast explanation! I am only using Oxidation (M) and Phosphorylation (STY), so I will look into the transfer learning. Thank you.

bertauchekurtis commented 2 years ago

Actually, I am still having the same issue even when specifiying the model file path:

C:\Users\Kurtis\Desktop\Research\AutoRT-master>py autort.py train -e 100 -b 64 -g models/base_model/model.json -u m -i C:\Users\Kurtis\Desktop\Research\data\autoRTTrainingSet.tsv -sm min_max -rlr -n 20 -o C:\Users\Kurtis\Desktop\Research\data Scaling method: min_max Step 1:

Traceback (most recent call last): File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort.py", line 158, in main() File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort.py", line 109, in main two_step_ensemble_models(input_data=input_file, nb_epoch=epochs, batch_size=batch_size, File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort\RTModels.py", line 675, in two_step_ensemble_models ensemble_models(models_file=models_file, input_data=input_data, ensemble_method=ensemble_method, File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort\RTModels.py", line 735, in ensemble_models with open(models_file, "r") as read_file: TypeError: expected str, bytes or os.PathLike object, not NoneType

wenbostar commented 2 years ago

We have pre-trained phosphorylation base model, so you could use the following command line to train your model with transfer learning strategy:

py autort.py train -i C:\Users\Kurtis\Desktop\Research\data\autoRTTrainingSet.tsv -o C:\Users\Kurtis\Desktop\Research\data\tf_model/ -e 40 -b 64 -u m -m models/ptm_base_model/phosphorylation_sty/model.json -rlr -n 10

Please note you have to use "1" to replace oxidated M, "2" to replace phosphorylated S, "3" to replace phosphorylated T and "4" to replace phosphorylated Y in your training data (C:\Users\Kurtis\Desktop\Research\data\autoRTTrainingSet.tsv) as described at https://github.com/bzhanglab/AutoRT#usage. Don't need to change unmodified M, S, T and Y in your training data.

wenbostar commented 2 years ago

Actually, I am still having the same issue even when specifiying the model file path:

C:\Users\Kurtis\Desktop\Research\AutoRT-master>py autort.py train -e 100 -b 64 -g models/base_model/model.json -u m -i C:\Users\Kurtis\Desktop\Research\data\autoRTTrainingSet.tsv -sm min_max -rlr -n 20 -o C:\Users\Kurtis\Desktop\Research\data Scaling method: min_max Step 1: Traceback (most recent call last): File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort.py", line 158, in main() File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort.py", line 109, in main two_step_ensemble_models(input_data=input_file, nb_epoch=epochs, batch_size=batch_size, File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort\RTModels.py", line 675, in two_step_ensemble_models ensemble_models(models_file=models_file, input_data=input_data, ensemble_method=ensemble_method, File "C:\Users\Kurtis\Desktop\Research\AutoRT-master\autort\RTModels.py", line 735, in ensemble_models with open(models_file, "r") as read_file: TypeError: expected str, bytes or os.PathLike object, not NoneType

Sorry, there was a typo in the command line. Please use the new command line as shown above.

bertauchekurtis commented 2 years ago

It is working now. Thank you very much for your help and showing me what I was doing wrong. :)

wenbostar commented 2 years ago

You're welcome. AutoRT could predict RT for phosphopeptides very accurately. If the prediction performance is not as good as shown at https://github.com/bzhanglab/AutoRT#performance-of-autort, please let me know. I'm happy to help.

bertauchekurtis commented 2 years ago

One more question if you don't mind. Once I've finished the training, which model file should I be using when I run the predictions?

wenbostar commented 2 years ago

Below please find the command line for prediction. The output folder C:\Users\Kurtis\Desktop\Research\data\tf_model contains the trained model files and you only need to give C:\Users\Kurtis\Desktop\Research\data\tf_model\model.json to -s which contains all the model information needed in prediction.

py autort.py predict -t test.tsv -s C:\Users\Kurtis\Desktop\Research\data\tf_model\model.json -o tf_prediction/ -p test
bertauchekurtis commented 2 years ago

Okay, thank you!

bertauchekurtis commented 2 years ago

What effect do the batch size and epoch size parameters have on the training of the model?

wenbostar commented 2 years ago

The two parameters may have a big impact on the performance. You could use "-e 40 -b 64". I used this on many datasets from small size (~1000 peptides) to very large size and it works very well.

bertauchekurtis commented 2 years ago

Okay, I will try that, thank you. My dataset has about ~70,000 peptides.

wenbostar commented 2 years ago

The size is large enough to train a very good model. If the dataset was generated from multiple MS/MS runs/fractions, make sure that the RTs of peptides from different runs/fractions are well aligned. In addition, for each peptide (peptide + modification, peptide form) only keep one entry in the training dataset.