DeepMicrobes: taxonomic classification for metagenomics with deep learning
Prediction using the seq2species model #3

Bartvelp commented 4 years ago

First off, thank you for your very nice paper, really interesting results! I am trying to reproduce your results, but I am getting stuck at the prediction using the Seq2Species model. I have installed DeepMicrobes and am trying to predict the species of 100bp fasta file (which contains 16s rRNA of E.coli).I do not have paired end reads.These are the commands I currently run: (DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$ seq2tfrec_onehot.py --input_seq=../test_fasta_100bp.fa --output_tfrec=../temp.onehot.tfrec --is_train=False --seq_type=fasta Which seems to run fine and then: (DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$ ./predict_seq2species.sh -i ../temp.onehot.tfrec -p 1 -m ../weights_seq2species/ -o test_output But this gives the following error:

(DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$ ./predict_seq2species.sh -i ../temp.onehot.tfrec -p 1 -m ../weights_seq2species/ -o test_output
Prediction started ...
2020-05-28 12:32:42.030930: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
RUNNING MODE:  predict_paired_class
I0528 12:32:42.035207 139629123098432 tf_logging.py:115] Using default config.
I0528 12:32:42.035573 139629123098432 tf_logging.py:115] Using config: {'_model_dir': '../weights_seq2species/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efdcec60438>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
I0528 12:32:42.096365 139629123098432 tf_logging.py:115] Calling model_fn.
I0528 12:32:42.222223 139629123098432 tf_logging.py:115] Done calling model_fn.
I0528 12:32:42.312916 139629123098432 tf_logging.py:115] Graph was finalized.
I0528 12:32:42.314475 139629123098432 tf_logging.py:115] Restoring parameters from ../weights_seq2species/model.ckpt-0
I0528 12:32:42.752866 139629123098432 tf_logging.py:115] Running local_init_op.
I0528 12:32:42.756458 139629123098432 tf_logging.py:115] Done running local_init_op.
Traceback (most recent call last):
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 370, in <module>
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/absl/app.py", line 278, in run
    _run_main(main, args)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/absl/app.py", line 239, in _run_main
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 353, in main
  File "/home/bart/DeepMicrobes/models/format_prediction.py", line 93, in paired_report
    batch_prob = average_paired_end(batch_prob, num_classes)
  File "/home/bart/DeepMicrobes/models/format_prediction.py", line 18, in average_paired_end
    prob_matrix = np.mean(np.reshape(prob_matrix, (-1, 4, num_classes)), axis=1)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 232, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 2505 into shape (4,2505)
paste: test_output.category_paired.txt: No such file or directory
rm: cannot remove 'test_output.category_paired.txt': No such file or directory
rm: cannot remove 'test_output.prob_paired.txt': No such file or directory
Result: test_output.result.txt
 

This reshape also does not make sense to me. I would just like the probability for each class. It seems the probabilities are held in prob_matrix[0] but I don't know which index corresponds to which class (species). Any help would greatly be appreciated.

MicrobeLab commented 4 years ago

Hello Bart,

The mapping from index for the prob matrix to class was provided in a tab-file. Here is the mapping file for the pre-trained species model: (the 1st column holds the species names and the 2nd column holds the corresponding indexes).


Note that you might have to train you own model according to your need. This pre-trained model is specific to the context of the paper.

For single-end reads, you could first convert them to interleaved reverse-complement form using seqtk. But this is optional. If you do this before TFRecord conversion, you could edit the line of (-1, 4, num_classes) of format_prediction.py by changing 4 to 2. And if not, you should change 4 to 1.

Bartvelp commented 4 years ago

Thanks a lot for your swift response! I understand now and it is working now. But as you suggested I should retrain the model. Which I again am experiencing some issues with, but I will open a new issue for it.