MicrobeLab / DeepMicrobes

DeepMicrobes: taxonomic classification for metagenomics with deep learning
https://doi.org/10.1093/nargab/lqaa009
Apache License 2.0
81 stars 21 forks source link

Error in tfrec_train_kmer.sh with a customized training set #17

Open shawnnn3di opened 2 years ago

shawnnn3di commented 2 years ago

Hi, I'm trying to use tfrec_train_kmer.sh for a training dataset I constructed, and I'm struggling with this. I got the errors below.

(DeepMicrobes) root@bbf7145cde62:/workspace/vamb-data/airways# tfrec_train_kmer.sh -i dmtrain.fa -v /workspace/czj/tokens_merged_12mers.txt -o dmtrain.tfrec -s 2048 -k 12
parallel successfully detected...
seq-shuf successfully detected...
Starting converting dmtrain.fa to TFRecord (mode=training), output will be saved in dmtrain.tfrec
Parameters: kmer=12, vocab_file=/workspace/czj/tokens_merged_12mers.txt, split_size=2048
======================================
1. Shuffling sequences for training...
(echo -n ">"; cat <&0) | sed "s/^>/\x0>/"
======================================
2. Splitting input to 2048 sequences per file...

======================================
3. Converting to TFRecord...
Can't use 'defined(@array)' (Maybe you should just omit the defined()?) at /workspace/czj/DeepMicrobes/DeepMicrobes/bin/parallel line 119.
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
Finished.

The first two lines of the dmtrain.fa looks like this:

>S4C8|22
GTTATAATTTCCCGGCTGGATCTCCTTGAAATCATCAGACAAAATACCTCTTCTTAAAAGTTCTGCCGTGCCTGCAAAGCGAAAATCACGAAGCCCCGGATTGTCTTTTAACTGATAAATCCCATATTTATCAGTATCGCCATACAAAAGCTGTGCTTCCTTGTTTGCGCCGCTTTTTTCAAGTTCCTCCTGCATCGTCCGGAGACTTCTTTCATTTTCCCAATCGCCCTTTTCTATGCCAAAGATACCCTCATGCTCTGTGATCTGCCTCCTGTTCTTCACCGTTGTTTCCGAACCGTCCTCATGGAGCAGGTACACAGGCAGGTCACGGTCAAAAAGCTCCAATGCCCTCCCCTGTGTCAATGGCAGCATTTCATTCCATGTATAGCCGTATTCCTCCATTTCCGATAAGCCGATCATAGGGTCAGGGAGTGCGTCAATCTCTGCCTGTGCGTCAATGACCGCAAGGGCAGCCCCCTGACTGCCTTCTGTTTCCTCATAATAAATATGCTCCGCAAGCTCCCTTGTCTTTTCCATGTCGCCCAGCTTAAAGGCATAATTTACAATCAGGCTGCGCTCATCATCAGAAAAAGCGGTCCTGGCAAATTCCAGACGGTTAATAATGCCCTCCGCCTGCTTTACCGGGGAAAGGTGGCTCGTCATTACCACTTCTTTTTTATCCAGGTCAATCTCAAAATCATTAAATTGCGGCATACCGGAATTTCTGCCGCCGCTGACAATAATCGGGATATAATCAGCCCCGTAGCTTTCCAGACAGTCATGGATATTTTTTCTGTCCATACCCTCCAACTTTGTGAGCGCACCCATGATCTCCTGAACTTCCTCCGGTGTCTTATTGATGATACGTTCCACCTCAAAACCGCCTGAATGTATGATCTTCAGAATCACATCATCTGCCCCGACAGGCTCCTTTGCCCTTTCATTCCACTCCATCTCTGCCTTCAGGTCAATGATCTCATTCTTATCGGTGATATACCGGACATCAAGGTGGTATTCTCCAAATTCCCTTGTTTCCTCATTGGCAATCTCTATAGTCCATGCGCCTTTGCTTTCCAGATATGCGGCAACGCTCAGTCTGTCATCGTCATTCATGGCATAGAGTGCTTCTATGATCTCCGCCGCATTCATTCCCCTGACTTCTACAAGGCTGTACTCAGACAGATCGCTGTTCTGTATCAGCAAAAGGCTTTCTTTTTCCTGCCCCTGCATGATCTCCCGGTCACGCTGTATTTCTCTAAGCTGTTCCTCTATACCTGTGATAAGCTCCGATGCAGTCCTGCGGATCGTATCAAGGGAAGATTTCAGTTCTTTCATATCCTTTCCGCTGCTCCACCCGGCAATATACCCAAAGGAATAACCAGAAGTATCTATGCCGAAGTGCTTACAGACTGTAAACGCTATACTCTCTGCTTCAAATGTTAATATATCCATTTGAGGCCTTATACCCATGTTTTTATAATGTTGATTTTTTATATTATTTCCCTTAAATCCTCACTTTCTTGTATAAAAGATAAGCATTATCAGATACGCTATTCTCAGGTTTTCCCCTCGAATGGGAAGCCGGAAAGGAGCGCATTTGATATGCAAATAAACTATTTAGATGCTGTTTCATCAGTCCTCAATATGATGAAGCAGCCAGACAGCGCATGTAAAAATATAGACATGCACAGAACCTGTTACACCATGTTCTTCAAATACCTGATGGATAAGGGCATTCCTTTTTCAATGGATGCCGCGCTGGACTGGCTTGAGATTAAGAAACAGGAAATTTCCTATGAGACGTGTTCCCAATATAGAAATGCCCTGTTCCGACTTGAGCATTACCTGCTCTTTGGAGATATCGAAAGCCCTTTCTGCCGCTCAGAAGACAGTTTTTTCTGCCGGAGCGGGATGTCGGAATCTTTTTTCCGCCTGACATATGAGCTGGAGGAATACTATGCGGCCAGCCAGAACCCCAGCTATTACCATACGTATTCCGTTGCCACAAAAGAGTTTTTCAAACTTGCGACTTCCCTTGGAATTACAGAGCCGGAAGCAGTCACCATAGATACTCTTATCGAATACTGGAATACTTACTGCAAATCCTGCGGCTCTCCCGTCAGACGCCAGAACGCCGTATGCGCTATGACGGCTCTTATGAAATACCTTCACCTTCGGGGTGATGTGCCGGAGTGTTATCAGCTGGTTCTTTTTGGCTGGAACGCTGAAATACTGTCTGGCATGAGGCTTTCCAAAACAGGCGCCGCATTCCATCCCAGTGTATCTCTTGAACATAAAGCTGAAGGGTATCTTGACGCCTTGGACGATTGGAAATACATGGAATCATCAAAAGCTGTTTACCGCAATGATTTCACCTGGTACTTTATGTTTTTGGAACTTAACCGCCTGGAGCATTCGGCAGAAACTGTAACTCTATTTACAGACATACTTCCGGATTGTCCGAATCAGGCCAAAGGCAGCAATCCTGTATCGGCCCGCCGTTCACACACGATCAGAATGTTTGAAAAGTATCTCCAGGGCACAATGGAATCTAATATGGCGGCTGATCCAAAGCGTGCGTCCGATCATCTTCCGTCATGGAGCAAAAGCATCCTTGATGGTTTTATAGAGAGCCGCAGGCGGGATGGTATGACGAATAATACACTTACTATGTGCAGGGCTGCCGGATGCAGTTTCTTCAAATATCTTGAAGATAATGGAATAGATTATCCGGCATACATAACACCTGATGCAGTGAAAGCATTCCATAACCATGATGTCCACTCGACCCCGGAAAGCAAAAATGCATATGGGACAAAGCTCCGTCAGCTTCTGCGTTACATGGCTGACCAGGATCTGGTCCCGCCAACCCTTGTTTTTGCAGTATCTGCAAGCTGCGCTCCCCGTCGCAGCATCGTTGATGTCCTGAGCGATGATATGGTTGGGAAAATATATGAATACCGCGACAAAGCCTCCACTCCCATAGAACTCAGAGACACAGCTATGGTTATGCTCGGGCTTCGGATGGGTATCAGGGGAGCGGACATCCTGAAGCTTCAGGTAAATGATTTTGACTGGAAAAACAAAACGGTTTCCTTCATCCAGCAGAAAACAGGAAAAGCAATCACGCTTCCAGTCCCAACAGATGTAGGTAATTCTATATATAAATACATCATGAATGGACGTCCGGAATCGGCTGCCACAGGCAGCGGATATATATTTATCCGCCATCAGGCGCCATATATTCCGCTTAAAGTCACAACGGCGTGCCGTGGGGCTTTAAAAAGAATACTTGCTGAATATGGATTTGAACTATCCGCCGGCCAGGGCTTCCATATGACACGGAAGACATTTGCCACAAGAATGCTTCGGGCAGGCAGCAAACTTGATGATATTTCCATCGCCCTCGGGCATGCACGTCCGGAAACTGCCGAGGTATATCTTGAACGTGACGAAGATAAAATGAGGCTCTGCCCTCTGGAATTTGGAGGTGTTTTGTCATGACATACATTTTTGAGAGCGGCCTGGCACATCATATCGAAGGACTCATACAGCAAAAACGGGCGGATGGATATGCCTATAATTGCGAAGAAAAGC
>S4C16|245
CGAGCAAACGAAGGCCGTACTAGAGATTCAGGCCAAGTGGAAGACTATAGGCTATGCTCGCAGAAGCGACAATGAGAAGATCTACGAGCGTTTCCGCGCAGCATGTGACGATTATTTCAATAAGAAAACAGCTTTCTTCAAAGGCAAACGTGAAGAGCTGACCGATAACTACAAGAAGAAGCTGGCCATGGTAGAAGAAGCGGAGAGCCTTCAGGAGAGTTCCGACTGGAAAGAAACCTCTACTCGCTTGGCCGAACTCCAAAAGAAATGGAAAACCATCGGAGCCGTTCCTCATCGGTATAGTGATGAGATATGGAAGCGTTTTACGACTGCATGCGATGCATTCTTCAAACGTAAAAAAGCCGAACAGGGAGATATGCGCTCCGAAGAATGCGAAAACCTGAAGAGCAAGAAAGCAATCATTGCAGAGCTTGAGACTTTGGATTCGGAAGAAGCAAGCGAGGGTATCATCGACAGGCTCAATGCTCTGGCCGGACGTTGGAATTCCATAGGCTTTGTACCGTTCAGAGAGAAGGATACTATCAACAAAGCTTACCGAAAATTGATCGATGGTCTGTACGACAAGCTGAATATCGAACGAAGCAACCGGCGCCTCGAAGGATACAATGCCTCCTTGGAACAACTGGAGGGTGGCGGCAAAGGACAGCTCTATGATGAACGTGATCGTATGACACGTATCCTCGACCGTATGCGCAACGAATTGCAGACCTATACGAACAATCTGGGTTTCCTCAATATATCCAGTAAAAGTGGGAATAGCCTGATGCGCGAAATAGAGCGCAAGAAGGAAAAGCTGGAAGAAGACATCCGTCTGATGATCGAAAAGATCAAGCTGATCGACAAGAAGGTGGAAGAGCTGAACTCTAAAGAGTAGGCTATCCCCCACTCCATCGGCAAAATAAAACCGAAGGAGAAAATAGCATTCAAGAATTGAGGTGAGCCACGAAAGTTTTATATCAGACTTTCGTGGCTCACTTCTTTTCTACTCGCTACTCATTGACAGAGTAAGAAACGCAAGGCCAAGAGATGAAAGACAGATACAAGGCTGTTTTTTATCTCGATAGCGCAACAACCAAAAGGGCTATGCTGTTTCATTTCTAAAAGGATATACCGATGAAGATAGTAATAGCGGACAGCTATGCAGCTCTACCCGGCGATTTGGACTGGAGCGGTATCGAAGAAATGGGCGAATGCGTGTTCTACGAATATACCCGTCCGGAGGATTTGACTCTGCGTGCTGTCGATGCTGAAATAGTGCTTACCAACAAGACTCCTGTGACTGCGGCCGACATGGAAAAGATGCCCCACCTACGTTACATCGGACTGATGATTACAGGCCTTAATCTTATAGATATGGATGCTGCTCGTCAGCGTGGTATCACCATAACGAACATCCCCCACTATAGCACAGAATCAGTAGCCCAAATGGCAATCTCGCATCTACTGCACATAACCATGCCGATCGGAGAACTTTCCCGGCAGGTGAAAGATGGTTGCTGGCAGAGCAATTACGAACAAATCTCTCGCAATACTTATCAGATAGAACTGAGCGGACTGACGATGGCTATCGTGGGACTTGGGGCAATAGGTACACGTGTAGCGGAAATGGCACGTGGATTCGGCATGAAGATTTTGGCACATACATCCAAATCTCCAATCGAGTTGCCTTCTTATATAGAAAAGTCCGATAGCCTGGAGAAGCTTTTCTCTCGGGCTGATGTGCTGAGTCTGCATTGCCCGCTCACAGCGCAAACCCAAAGGATGGTATCGGCTGATAGGCTGGCACTGATGAAACCGACAGCTATCCTGCTGAACATGTCCCGAGGAAGTCTGATCGATGAAAAAGCATTAGCCTCTGCCCTAAATGAAGGACGGCTCTATGCTGCAGGCTTGGACGTACTTGCGGAAGAACCTCCATGCATGGATCACCCTTTGCTTAAGGCGCGTAATTGTCACATCACGCCACATATGGGCTGGAATACGGATGCAGCGCGCTTGCGCCTTTCTCGGACGATCAAGGAGAATCTTCGGGCTTTCATTTCCGGTCACCCTGTCAATGTCGTTTAAGAACAGAATCCATCAAAACGATTATTTTCCGACCAATACCTTTCGAAGAATTTGACGGATTTATCCTCGATAAATCTACGTGTGTTCGA

Could you have a look and see if I've done anything wrong? Thanks!

MicrobeLab commented 2 years ago

Hi, you did not have parallel correctly installed. Try installing it using the command: (wget -O - pi.dk/3 || curl pi.dk/3/) | bash

shawnnn3di commented 2 years ago

Thanks for your reply!

I ran wget -O - pi.dk/3 | bash since I'm on a Linux OS. The error repeats after the installation is completed. I would like to solve the problem myself but I don't really know where to start.

Many thanks in advance!

MicrobeLab commented 2 years ago

Please check first whether parallel itself has been well installed before running our scripts. Thanks