Open Gayathri142 opened 3 months ago
Hi, Simulated reads should be put in a fasta file and shuffled. The large fasta file can then be split into a lot of small fasta files and thereby a lot of small tfrec files. These tfrec files can be sequentially provided to the model (We set epoch to be always one). When next tfrec is given, the model takes in these new reads while continue training starting from previous checkpoint.
Okay, please correct me if I'm wrong and provide more guidance: So for the 1500 reference simulated reads,
This seems fine, but creating a multi-fasta file with reads from 1500 references will become extremely heavy in terms of storage and processing. Is there an alternative or am I missing something?
Thanks.
Sorry I don't find an alternative.
Thanks. Also, how long did the training take for you on 120 genus genomes? and what was the computational environment? I would like to make an estimate for the time it might take for the references I have.
The computational environment was mentioned in the paper (40G GPU, 8 CPU cores). It took roughly 1-2 days for the genus model.
Hello, I have my own dataset of about ~1500 references that I want to train the genus model on. I have simulated reads for them individually and now at the .tfrec creation step. My question is, do I have to generate a .tfrec for each fasta reference I have (i.e. around ~1500 .tfrec files) or generate one .tfrec file by merging all my simulated reads for each reference into one multi-fasta file?
If I have to create a multi-fasta file for all the ~1500 references it will become computationally and storage expensive. If I create individual .tfrec files for each reference, how should the training be done? From this command my understanding is only one .tfrec file should be given as input for training. So Should I train the model on one reference .tfrec and once that is done, train on the next reference .tfrec and so on until all 1500 are done?
Thanks.
The command mentioned on github: DeepMicrobes.py --input_tfrec=train.tfrec --model_name=attention --model_dir=/path/to/weights**