bede / hostile

Precise host read removal
MIT License
78 stars 4 forks source link

Mode: paired short read (Bowtie2) fails when index is provided as reference fasta file #38

Closed Rohit-Satyam closed 5 months ago

Rohit-Satyam commented 5 months ago

Thanks for providing us with hassle-free and fast dehosting tool.

I am however running into an issue when using PE fastq files and providing the custom reference fasta file of Bos taurus. For ONT fastq files, hostile automatically indexes the fasta file but the same is not true for PE bowtie mode. Can this be implemented?

hostile clean --fastq1 SRR27845761_1.fastq.gz --fastq2 SRR27845761_2.fastq.gz --threads 10     --index Bos_taurus.ARS-UCD1.3.dna.toplevel.fa
10:37:37 INFO: Hostile version 1.1.0. Mode: paired short read (Bowtie2)
Traceback (most recent call last):
  File "/home/subudhak/miniconda3/envs/serotyper/bin/hostile", line 10, in <module>
    sys.exit(main())
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 154, in main
    defopt.run(
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/defopt.py", line 356, in run
    return call()
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/cli.py", line 68, in clean
    stats = lib.clean_paired_fastqs(
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/lib.py", line 225, in clean_paired_fastqs
    index_path = aligner.value.check_index(index, offline=offline)
  File "/home/subudhak/miniconda3/envs/serotyper/lib/python3.10/site-packages/hostile/aligner.py", line 61, in check_index
    raise FileNotFoundError(message)
FileNotFoundError: Bos_taurus.ARS-UCD1.3.dna.toplevel.fa is neither a valid custom index path nor a valid standard index name
bede commented 5 months ago

Hi! Thanks for drawing attention to this, which works today but should be better documented.

Whereas the index parameter accepts a path to a genome in Minimap2 (long read) mode, in Bowtie2 (short read) mode it accepts a path to a precomputed Bowtie2 index, minus the .x.bt2 extension.

So you'll need to build a bowtie2 index (beware, human genome takes 30 mins or so)

bowtie2-build Bos_taurus.ARS-UCD1.3.dna.toplevel.fa bostaurus

And then

hostile clean --fastq1 SRR27845761_1.fastq.gz --fastq2 SRR27845761_2.fastq.gz --threads 10     --index bostaurus

This annoying implementation detail is normally hidden when using standard indexes.

I've updated the readme to make this behaviour clearer.