Question about training a DeepMicrobes model to predict the taxonomy from phylum to species

Steven-GUHK commented 1 year ago

Hi! I have read the DeepMicrobes paper and it is great work! I'm planning to train the DeepMicrobes model using my data and I want it to predict the taxonomy of a given DNA sequence from phylum to species (six ranks in total).

My data looks like this:

TaxID_1.fasta
>sequence_0
ATCG...
>sequence_1
ATCG...
...
>sequence_n
ATCG...

...

TaxID_n.fasta
>sequence_0
ATCG...
>sequence_1
ATCG...
...
>sequence_n
ATCG...

Where each fasta file represents one species and the TaxID is the taxonomy id in the NCBI database. Each fasta file may contain many DNA sequences. I have read the instruction about how to convert fasta sequences to TFRecord. However, I'm confused about 'The script parses category labels from sequence IDs starting with prefix|label (e.g., >this_is_prefix|0).' I wonder what is the prefix and why the number 0 can be the label. Also, what are they corresponding to my data?

Another question is do I need to train six different models to predict six different ranks of taxonomy? For example, one model for phylum, one for class, one for order, etc.

Thank you very much if you could give me some suggestions!

MicrobeLab commented 1 year ago

Hi!

The script attempts to parse the number placed in the header after the first "|". The prefix can be any word and is just used to distinguish the sequences for humans (not the script). The number is the index (since python counts numbers from zero) for categories. If you have, let's say, 5 phyla, the number should assigned from 0 to 4, and the order of phyla can be random.

Yes, you should train a model for each rank.

Steven-GUHK commented 1 year ago

Hi!

The script attempts to parse the number placed in the header after the first "|". The prefix can be any word and is just used to distinguish the sequences for humans (not the script). The number is the index (since python counts numbers from zero) for categories. If you have, let's say, 5 phyla, the number should assigned from 0 to 4, and the order of phyla can be random.

Yes, you should train a model for each rank.

Thanks for your reply! I followed the instruction on How to train the DNN model of DeepMicrobes. I have labeled my data through fna_label.py and I'm moving to Read Simulation and TFRecord conversion.

I think it is OK to have some repeat labels, for example:

different species may belong to the same phylum.

For Read Simulation, how big effect it is to the result? My data is complete genomes type. It seems that I need the corresponding reads type data.

For TFRecord conversion, I have successfully run the command. Since I have many fasta files, do I need to concatenate them into one file and conduct TFRecord conversion? It seems that the training command DeepMicrobes.py --input_tfrec=train.tfrec --model_name=attention --model_dir=/path/to/weights only accepts one train.tfrec file.

Thank you!

MicrobeLab commented 1 year ago

You will need to perform read simulation to break the complete genomes into short fragments since the model takes as input short reads during prediction.

The TFRecord files should be generated from shuffled fasta, which means that, let's say, the first 2048 (the batch size) reads, should include reads that belong to different categories.

Steven-GUHK commented 1 year ago

You will need to perform read simulation to break the complete genomes into short fragments since the model takes as input short reads during prediction.

The TFRecord files should be generated from shuffled fasta, which means that, let's say, the first 2048 (the batch size) reads, should include reads that belong to different categories.

I'm a little confused. Here is the fasta file I got after running fna_label.py. I have 1k files like it.
label_166.SAMN16347101__specI_v4_01407.fasta.zip

Is the Read simulation step compulsory in preparing training data? If yes, what should be the input file for random_trim.py?

What should be the input for tfrec_train_kmer.sh? Because tfrec_train_kmer.sh -i train.fa -v /path/to/vocab/tokens_merged_12mers.txt -o train.tfrec -s 20480000 -k 12 only accepts a single fatsa file as input but I have 1k files. Same for random_trim.py -i input_fastq -o output_fasta -f fastq -l 150 -min 0 -max 75, it accepts one input.

Therefore, I think I should concatenate 1k files into one fasta file before doing these two steps.

Is my understanding correct?

MicrobeLab commented 1 year ago

The Read simulation step is compulsory. Otherwise the range of read length in training and prediction will be unmatched. Yes, you should concatenate all the simulated reads.

Steven-GUHK commented 1 year ago

Thanks for your patience. It was my mistake that I only looked at the GitHub steps and forgot that you mentioned you use ART to simulate the reads.

According to the paper, I'm trying to use this command to simulate reads: art_illumina -ss HS25 -i input.fasta -p -l 150 -f 1 -m 400 -s 50 -o paired_data I'm not sure about the -f. I saw your answer to this issue https://github.com/MicrobeLab/DeepMicrobes/issues/23#issuecomment-1430689531. However, I'm not sure about the total number of reads I need. So I just set every genome with -f 1.

After this, I obtained read1 and read2 files for every genome, I wonder whether I need to shuffle the data before running tfrec_train_kmer.sh. Because I see that there is a shuffle function in tfrec_train_kmer.sh, so I guess I just simply concatenated all the reads together.

MicrobeLab commented 1 year ago

It is recommended to first determine a total number of reads for each class for class balance purpose, and then calculate the -f. But the exact number of reads depends on personal dataset.

Yes, the script will do the shuffle work.

Steven-GUHK commented 1 year ago

Thanks for your advice. I have successfully run the training step using the command: nohup python DeepMicrobes.py --input_tfrec=train.tfrec --model_name=attention --model_dir=weights_phylum --num_classes=36 --train_epochs=1> train.log 2>train-error.log & Because I have 36 classes for the phylum classification, so I set --num_classes=36. However, I'm curious about the --train_epochs. The default value is 1 (Is it too small?). I would like to know when the training will stop because it doesn't show how many steps are left. My current training step is like this:

I see that the '_save_checkpoints_steps': 100000. Does it mean every epoch will have 100000 steps?

Also, for the testing, I have my separate testing data. The data preprocessing steps for testing are similar to the training: label assignment, reads simulation, and conversion to TFRecord. The difference is the simulated reads do not need to be trimmed. Also, I need to concatenate all the forward reads into one file, and all the reverse reads into one file and convert these two files to TFRecord.

MicrobeLab commented 1 year ago

Our training scheme is that if more training steps are needed, simulate more reads (not repeatedly train on the same data), so the epoch is 1. The number of step for a specific dataset is calculated as total_number_or_reads / batch_size. Total training steps required should be highly varied for different circumstances.

Yes you are correct except that the forward and reverse reads are interleaved rather than concatenated.

Steven-GUHK commented 1 year ago

Therefore, if I want to reduce the training time, I can set a larger batch_size but make sure that the total steps are more than the value of _save_checkpoints_steps. The model's weights won't be saved unless the steps reach an integer multiple of _save_checkpoints_steps right?

For the test data, what I see is:

So I concatenate all the forward reads into one file (sample_R1), and all the reverse reads into one file (sample_R2). I guess the tfrec_predict_kmer.sh will convert them to the interleaved type?

MicrobeLab commented 1 year ago

You may modify the code by yourself to set the frequency of checkpoint saving. The batch size can be a hyper parameter that affects model performance.

Yes the script does the work.

Steven-GUHK commented 1 year ago

Another question is I saw you mention that Recommended batch size for training on thousands of species is 2048 or 4096. Try a lower value when training on fewer classes. I have 15000+ complete genomes and there are 30+ kinds of phyla, 130+ classes, 330+ orders, 870+ families, 4200+ genera, and 15000+ species. I wonder about the proper batch size for training these six ranks. Thank you!

MicrobeLab commented 1 year ago

The GPU memory is a major restriction of the batch size. In theory, using a large batch size should be helpful when training on a lot of classes. But a large batch size will run out of memory and make the running time for a batch very slow. My intuition (you need to try) is that for 4200+ genera and the higher taxonomic levels, 4096 should be okay, and for 15000+ species, you may start with the largest batch size that fits into the memory.

MicrobeLab / DeepMicrobes

Question about training a DeepMicrobes model to predict the taxonomy from phylum to species #26