2 - QC host filtering - Githubissues

kmhandley commented 3 months ago

I ran this, but when I compared the reads before and after filtering they have the same line count - i.e., no reads were filtered out.

Did we have an animal to filter out in the first place? If not, the exercise doesn't make a lot of sense.

I strongly recommend using the script on only a single test fastq pair WITH host/euk, and adding a wc check line.

for i in {1..4}; do wc -l sample${i}_R1.fastq; wc -l sample${i}_R1_hostFilt.fastq; done 2199968 sample1_R1.fastq 2199968 sample1_R1_hostFilt.fastq 2199964 sample2_R1.fastq 2199964 sample2_R1_hostFilt.fastq 2199968 sample3_R1.fastq 2199968 sample3_R1_hostFilt.fastq 2199988 sample4_R1.fastq 2199988 sample4_R1_hostFilt.fastq

kmhandley commented 3 months ago

PS a subset of a pair of Waiwera fastq files from the brackish zone could work well.

JSBoey commented 3 months ago

Waiwera sequences did not have any/appreciable levels of human reads. (yay!) Trying a different approach: Simulating HiSeq of human genome and adding it to existing mock metagenomes as a separate read set.

JSBoey commented 3 months ago

Simulated "contamination" worked, the new example consists of one paired-end library named human_microb_reads.R{1,2}.fastq.gz with about 9k reads per file from a human genome.

GenomicsAotearoa / metagenomics_summer_school

2 - QC host filtering #61