Option to bypass host filter step?

DGleason-680 commented 7 months ago

I'm encountering several issues running the pipeline on my paired-end reads. Is there an option to bypass the host filter step? My samples are environmental so no host to filter - I think this may be the reason my samples are not running successfully, but not confident that's the only issue.

Trying to troubleshoot through the sample_err.txt and sample_out.txt files, and I have several issues to address:

Illegal unprintable character in FASTQ file IS THIS DETRIMENTAL TO THE SUCCESS OF THE RUN?
Missing files/directories: _OTHER THAN THE Mousecds.fasta FILE, ARE THE REST RELATED TO NOT HAVING A HOST?
- Mouse_cds.fasta - NOT SURE WHERE THIS COMES FROM OR WHY IT'S A MISSING FILE
- host_contaminents_seq.fasta
- singletons_no_host.fastq
- pair_1_no_host.fastq
- pair_2_no_host.fastq
- vector_contaminants_seq.fasta
- singletons_no_vectors.fastq
- pair_1_no_vectors.fastq
- pair_2_no_vectors.fastq
Errors in BLAT file paths:
- The pipeline is unable to find certain BLAT output files like singletons_no_host.blatout, pair_1_no_host.blatout, singletons_no_vectors.blatout, pair_1_no_vectors.blatout, etc.
- COULD THIS ALL BE RELATED TO NO HOST ALSO?
Python script errors:
- There are Python script errors like FileNotFoundError and ModuleNotFoundError, indicating issues with file paths or missing modules.
- _ARE THE AFFECTED SCRIPTS (read_BLAT_filter_v3.py, bwa_read_sorter.py, readsplit.py) CORRECTLY ACCESSING THE REQUIRED FILES AND MODULES? IS THIS RELATED TO NOT HAVING A HOST?

My config.ini file is as follows:

[Databases]
database_path = /home/dvan/scratch/MetaPro/dbs/
UniVec_Core = %(database_path)s/univec_core/UniVec_Core.fasta
Adapter = %(database_path)s/trimmomatic_adapters/TruSeq3-PE-2.fa
Rfam = %(database_path)s/Rfam/Rfam.cm
DNA_DB = %(database_path)s/chocophlan_1/chocophlan_full.fasta
DNA_DB_Split = %(database_path)s/chocophlan_split/
Prot_DB = %(database_path)s/nr/nr
Prot_DB_reads = %(database_path)s/nr/nr
accession2taxid = %(database_path)s/accession2taxid/accession2taxid
nodes = %(database_path)s/WEVOTE_db/nodes_wevote.dmp
names = %(database_path)s/WEVOTE_db/names_wevote.dmp
Kaiju_db = %(database_path)s/kaiju_mine/kaiju_db_nr.fmi
Centrifuge_db = %(database_path)s/centrifuge_db/nt
SWISS_PROT = %(database_path)s/swiss_prot_db/
SWISS_PROT_map = %(database_path)s/swiss_prot_db/SwissProt_EC_Mapping.tsv
PriamDB = %(database_path)s/PRIAM_db/
DetectDB = %(database_path)s/DETECTv2/
WEVOTEDB = %(database_path)s/WEVOTE_db/
taxid_tree = %(database_path)s/taxid_trees/class_tree.tsv
kraken2_db = %(database_path)s/kraken2_db/
EC_pathway = %(database_path)s/EC_pathway/EC_pathway.txt
path_to_superpath = %(database_path)s/path_to_superpath/pathway_to_superpathway.csv
MetaGeneMark_model = /pipeline_tools/mgm/MetaGeneMark_v1.mod

[Tools]
Python = python3
Java = java -jar
cdhit_dup = /pipeline_tools/cdhit_dup/cd-hit-dup
Timmomatic = /pipeline_tools/Trimmomatic/trimmomatic-0.36.jar
AdapterRemoval = /pipeline_tools/adapterremoval/AdapterRemoval
vsearch = /pipeline_tools/vsearch/vsearch
Flash = /pipeline_tools/FLASH/flash
BWA = /pipeline_tools/BWA/bwa
SAMTOOLS = /pipeline_tools/samtools/samtools
BLAT = /pipeline_tools/PBLAT/pblat
DIAMOND = /pipeline_tools/DIAMOND/diamond
Blastp = /pipeline_tools/BLAST_p/blastp
Barrnap = /pipeline_tools/Barrnap/bin/barrnap
#note: needle is quite slow.  This argument can be swapped for stretcher, but stretcher is not as accurate
Needle = /pipeline_tools/EMBOSS-6.6.0/emboss/needle
Blastdbcmd = /pipeline_tools/BLAST_p/blastdbcmd
Makeblastdb = /pipeline_tools/BLAST_p/makeblastdb
Infernal = /pipeline_tools/infernal/cmscan
Kaiju = /pipeline_tools/kaiju/kaiju
Centrifuge = /pipeline_tools/centrifuge/centrifuge
Priam = /pipeline_tools/PRIAM_search/PRIAM_search.jar
Detect = /pipeline/Scripts/Detect_2.2.9.py
BLAST_dir = /pipeline_tools/BLAST_p
WEVOTE = /pipeline_tools/WEVOTE/WEVOTE
Spades = /pipeline_tools/SPAdes/bin/spades.py
MetaGeneMark = /pipeline_tools/mgm/gmhmmp

#default is 30, to coincide with cdhit_dup's default.  
[Settings]
AdapterRemoval_minlength = 30
#bypass_log_name = bypass_log_0.1.txt

Show_unclassified = No
RPKM_cutoff = 0.01
BWA_cigar_cutoff = 90
BLAT_identity_cutoff = 85
BLAT_length_cutoff = 0.65
BLAT_score_cutoff = 60
DIAMOND_identity_cutoff = 85
DIAMOND_length_cutoff = 0.65
DIAMOND_score_cutoff = 60

#determines how much memory should be free for use with each tool.  [1-99]
BWA_mem_threshold = 75
BLAT_mem_threshold = 75
DIAMOND_mem_threshold = 80
BWA_pp_mem_threshold = 30
BLAT_pp_mem_threshold = 75
DIAMOND_pp_mem_threshold = 80
DETECT_mem_threshold = 80
Infernal_mem_threshold = 75
Barrnap_mem_threshold = 75

#maximum allowable concurrent instances of the tool [1-number of cores on your machine]
BWA_job_limit = 40
BLAT_job_limit = 40
DIAMOND_job_limit = 40
BWA_pp_job_limit = 40
BLAT_pp_job_limit = 40
DIAMOND_pp_job_limit = 40
DETECT_job_limit = 40
Infernal_job_limit = 40
Barrnap_job_limit = 40

#determines how long (seconds) each job should wait before they are launched (imposes delay.  for memory-use-detection)
BWA_job_delay = 0.5
BLAT_job_delay = 5
DIAMOND_job_delay = 5
BWA_pp_job_delay = 0.01
BLAT_pp_job_delay = 0.05
DIAMOND_pp_job_delay = 5

#toggle these options to delete the interim data the steps will generate
keep_all = yes
keep_quality = no
keep_host = no
keep_vector = no
keep_rRNA = no
keep_repop = no
keep_assemble_contigs = yes
keep_GA_BWA = no
keep_GA_BLAT = no
keep_GA_DIAMOND = no
keep_GA_final = no
keep_TA = no
keep_EC = no
keep_outputs = no

#Data gets chunked in the pipeline.  Control the chunk size to reduece the number of files generated.  Increase for speed, and reduced memory load per tool-run [1-99999]
rRNA_chunk_size = 50000
GA_chunk_size = 10000
EC_chunk_size = 1000

#Decides the lossiness of host + vector filtration.  For paired-end annotation only.  [high/low] high: only if both pairs pass through the filter.  low: if either pair passes through filter, both pairs pass
filter_stringency = high

TA_mem_threshold = 80
TA_job_delay = 10

[Labels]
GA_pre_scan = order_1_GA_pre_scan
GA_BWA = order_test_GA_BWA_1
GA_BLAT = order_test_GA_BLAT_1
host_filter = host_read_filter
vector_filter = vector_read_filter
ta = order_ta_1
#GA_DIAMOND = order_test_GA_dmd_1

I would appreciate any advice on this! Really want to see what this pipeline can do.

Thanks, Danielle

billytaj commented 7 months ago

to bypass the host filter, use the option "--no_host" in your run.

billytaj commented 7 months ago

my bad: --no-host

hyphen, not underscore.

DGleason-680 commented 6 months ago

Yes this worked - thank you.

ParkinsonLab / MetaPro

Option to bypass host filter step? #23