MicrobeLab / DeepMicrobes

DeepMicrobes: taxonomic classification for metagenomics with deep learning
Apache License 2.0
81 stars 21 forks source link

0 byte .tfrec #10

Closed animesh closed 3 years ago

animesh commented 3 years ago

Looks like tfrec_predict_kmer.sh is unable to create a proper .tfrec

(base) animeshs@DMED7596:~/ayu$ ls -ltrh
-rwxrwxrwx 1 animeshs animeshs    0 Feb 23 17:37 s13dm.tfrec

any ideas how to proceed further?

Setup is WSL/ubuntu-18.04

(base) animeshs@DMED7596:~/ayu$ uname -a
Linux DMED7596 5.4.91-microsoft-standard-WSL2 #1 SMP Mon Jan 25 18:39:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

prereqs i had to install (hope they are right?)

git clone https://github.com/MicrobeLab/DeepMicrobes-data
sudo apt install parallel
sudo apt install seqtk


(base) animeshs@DMED7596:~/ayu$ bash DeepMicrobes/pipelines/tfrec_predict_kmer.sh  -f fastq/s13._1.fastq -r fastq/s13._2.fastq  -o s13dm -v ./DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz
parallel successfully detected...
seqtk successfully detected...
Starting converting fastq/s13._1.fastq and fastq/s13._2.fastq to TFRecord (mode=prediction), output will be saved in s13dm.tfrec
Parameters: kmer=12, vocab_file=./DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz, split_size=4000000, sequence_type=fastq
1. Interleaving R1 and R2...
2. Splitting the merged file to 4000000 sequences per file...

3. Converting to TFRecord...
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
MicrobeLab commented 3 years ago

In the manuscript I used #!/usr/bin/env python, not /usr/bin/env: ‘python\r’. "\r" looks like a strange delimiter. Make sure that the scripts are the same as those in this repository.

animesh commented 3 years ago

Looks like the issue was line-endings used by different OS @MicrobeLab , at least installing covertor

sudo apt install dos2unix
dos2unix DeepMicrobes/*py
dos2unix DeepMicrobes/*/*py

solved the problem! At least i have the .tfrec now 👍🏼 Is there a way to check if .tfrec is correct? is it some sort of one-hot encoding of the interleaved fastq?

MicrobeLab commented 3 years ago

Try using the functions in https://github.com/MicrobeLab/DeepMicrobes/blob/master/models/input_pipeline.py to parse the content in the tfrec. Whether the scripts were designed for one-hot or k-mer encoding can be easy to know by the script names.