Error in tfrec_train_kmer.sh with SILVA 138.1 SSU database as training set

oschakoory commented 2 years ago

Hi, I would like to train the network with SILVA 138.1 SSU database using

tfrec_train_kmer.sh -i SILVA_138.1_SSURef_NR99_tax_silva.fasta -v /vocabulary/tokens_merged_12mers.txt -o train.tfrec -s 20480000 -k 12

However, i am getting the following error:

parallel successfully detected...
seq-shuf successfully detected...
Starting converting SILVA_138.1_SSURef_NR99_tax_silva.fasta to TFRecord (mode=training), output will be saved in train.tfrec
Parameters: kmer=12, vocab_file=/vocabulary/tokens_merged_12mers.txt, split_size=20480000

1. Shuffling sequences for training...
(echo -n ">"; cat <&0) | sed "s/^>/\x0>/" 

2. Splitting input to 20480000 sequences per file...

3. Converting to TFRecord...
INFO:tensorflow:Processing training/eval set
INFO:tensorflow:Parsing vocabulary
Traceback (most recent call last):
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 243, in <module>
    main()
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 230, in main
    training_set_convert2tfrecord(input_seq, output_tfrec, kmer, vocab, seq_type)
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 120, in training_set_convert2tfrecord
    seq, label_id = training_set_read_parser(rec)
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 56, in training_set_read_parser
    label_id = int(identifier.split('|')[1])
IndexError: list index out of range
Finished.

Can you help me please?

Thank you.

Originally posted by @oschakoory in https://github.com/MicrobeLab/DeepMicrobes/issues/17#issuecomment-1143659382

MicrobeLab commented 2 years ago

Hi, you did not provide a label file.

oschakoory commented 1 year ago

Hi, There is no option to provide a label file to tfrec_train_kmer.sh.

Is there another way to train the NN with this database?

Thank you for your help.

MicrobeLab commented 1 year ago

Hi,

The training_set_read_parser function in seq2tfrec_kmer.py parses each training/eval read in biopython-parsed format. Taxon ids are assumed to be available in read names. (E.g. for read with name >NC_018018.1|999|GCF_000265505.1-200000, 999 is parsed as its species taxon id.)

Please check whether the taxon ids have been included in read names.

oschakoory commented 1 year ago

The SILVA database is :

>MF461073.1.1202 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Aeromonadaceae;Aeromonas;Aeromonas sp.
GUGCCAUGCGGCAGCUACACAUGCAGUCGAGCGGCAGCGGGAAAGUAGCUUGCUACUUUUGCCGGCGAGCGGCGGACGGG
>JQ063432.1.1464 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Pectobacteriaceae;Sodalis;endosymbiont of Columbicola koopae
AUUGAACGCUGGCGGCAGGCCUAACACAUGCAAGUUGAGCGGCAGCGGGAAGAGGCUUGCUUCUUUGCCGGCGAGCGGCG
>EF216903.1.2125 Eukaryota;Amorphea;Amoebozoa;Discosea;Flabellinia;Dactylopodida;Neoparamoeba;Paramoeba perurans
ACCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUUAAAGACUAAGCCAUGCACGUCUAAGUAUAAACACUUUGUACUU

In this case how should i generate the label file?

Thank you for such a quick respond.

MicrobeLab commented 1 year ago

Please refer to: https://github.com/MicrobeLab/DeepMicrobes/blob/master/document/train.md

MicrobeLab / DeepMicrobes

Error in tfrec_train_kmer.sh with SILVA 138.1 SSU database as training set #19