Training on multiple genomes

MicrobeLab / DeepMicrobes

DeepMicrobes: taxonomic classification for metagenomics with deep learning

https://doi.org/10.1093/nargab/lqaa009

Apache License 2.0

83 stars 21 forks source link

Training on multiple genomes #35

Open Gayathri142 opened 3 months ago

Gayathri142 commented 3 months ago

Hello, I have my own dataset of about ~1500 references that I want to train the genus model on. I have simulated reads for them individually and now at the .tfrec creation step. My question is, do I have to generate a .tfrec for each fasta reference I have (i.e. around ~1500 .tfrec files) or generate one .tfrec file by merging all my simulated reads for each reference into one multi-fasta file?

If I have to create a multi-fasta file for all the ~1500 references it will become computationally and storage expensive. If I create individual .tfrec files for each reference, how should the training be done? From this command my understanding is only one .tfrec file should be given as input for training. So Should I train the model on one reference .tfrec and once that is done, train on the next reference .tfrec and so on until all 1500 are done?

Thanks.

The command mentioned on github: DeepMicrobes.py --input_tfrec=train.tfrec --model_name=attention --model_dir=/path/to/weights**

MicrobeLab commented 3 months ago

Hi, Simulated reads should be put in a fasta file and shuffled. The large fasta file can then be split into a lot of small fasta files and thereby a lot of small tfrec files. These tfrec files can be sequentially provided to the model (We set epoch to be always one). When next tfrec is given, the model takes in these new reads while continue training starting from previous checkpoint.

Gayathri142 commented 3 months ago

Okay, please correct me if I'm wrong and provide more guidance: So for the 1500 reference simulated reads,

they must be concatenated into one big fasta file and sequences should be shuffled?
the large fasta file should be split into smaller ones based on random splitting?
split fasta files must be used for tfrec generation. So if there are 10 fasta files, 10 tfrec files must be generated?
these 10 files must be provided sequentially and the model will pick up from last checkpoint?

This seems fine, but creating a multi-fasta file with reads from 1500 references will become extremely heavy in terms of storage and processing. Is there an alternative or am I missing something?

Thanks.

MicrobeLab commented 3 months ago

Yes.
Since we have shuffled at the read level (step 1), the large fasta file does not have to be random split.
Yes.
Yes.

Sorry I don't find an alternative.

Gayathri142 commented 3 months ago

Thanks. Also, how long did the training take for you on 120 genus genomes? and what was the computational environment? I would like to make an estimate for the time it might take for the references I have.

MicrobeLab commented 3 months ago

The computational environment was mentioned in the paper (40G GPU, 8 CPU cores). It took roughly 1-2 days for the genus model.