MicrobeLab / DeepMicrobes

DeepMicrobes: taxonomic classification for metagenomics with deep learning
https://doi.org/10.1093/nargab/lqaa009
Apache License 2.0
81 stars 21 forks source link

0 byte .tfrec #10

Closed animesh closed 3 years ago

animesh commented 3 years ago

Looks like tfrec_predict_kmer.sh is unable to create a proper .tfrec

(base) animeshs@DMED7596:~/ayu$ ls -ltrh
-rwxrwxrwx 1 animeshs animeshs    0 Feb 23 17:37 s13dm.tfrec

any ideas how to proceed further?

Setup is WSL/ubuntu-18.04

(base) animeshs@DMED7596:~/ayu$ uname -a
Linux DMED7596 5.4.91-microsoft-standard-WSL2 #1 SMP Mon Jan 25 18:39:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

prereqs i had to install (hope they are right?)

git clone https://github.com/MicrobeLab/DeepMicrobes-data
sudo apt install parallel
sudo apt install seqtk

CLI

(base) animeshs@DMED7596:~/ayu$ bash DeepMicrobes/pipelines/tfrec_predict_kmer.sh  -f fastq/s13._1.fastq -r fastq/s13._2.fastq  -o s13dm -v ./DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz
parallel successfully detected...
seqtk successfully detected...
Starting converting fastq/s13._1.fastq and fastq/s13._2.fastq to TFRecord (mode=prediction), output will be saved in s13dm.tfrec
Parameters: kmer=12, vocab_file=./DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz, split_size=4000000, sequence_type=fastq
======================================
1. Interleaving R1 and R2...
https://github.com/fbdesignpro/sweetviz
======================================
2. Splitting the merged file to 4000000 sequences per file...

======================================
3. Converting to TFRecord...
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
Finished.
MicrobeLab commented 3 years ago

In the manuscript I used #!/usr/bin/env python, not /usr/bin/env: ‘python\r’. "\r" looks like a strange delimiter. Make sure that the scripts are the same as those in this repository.

animesh commented 3 years ago

Looks like the issue was line-endings used by different OS @MicrobeLab , at least installing covertor

sudo apt install dos2unix
dos2unix DeepMicrobes/*py
dos2unix DeepMicrobes/*/*py

solved the problem! At least i have the .tfrec now 👍🏼 Is there a way to check if .tfrec is correct? is it some sort of one-hot encoding of the interleaved fastq?

MicrobeLab commented 3 years ago

Try using the functions in https://github.com/MicrobeLab/DeepMicrobes/blob/master/models/input_pipeline.py to parse the content in the tfrec. Whether the scripts were designed for one-hot or k-mer encoding can be easy to know by the script names.