biocore / tcga

Microbial analysis in TCGA data
BSD 3-Clause "New" or "Revised" License
88 stars 44 forks source link

Which script can remove human sequence which contained in bam files? #26

Closed wanglu2014 closed 2 years ago

wanglu2014 commented 2 years ago

When I repeat your research, I can not figure out how to remove human sequence which contained in bam files, according to your paper. 图片 Could you kindly suggest a method to do it? Thank you for your time.

gregpoore commented 2 years ago

Hi there, as written in that snippet you posted, we first extracted unmapped (i.e., non-human) reads from the pre-computed bam files prior to profiling them with Kraken. For this paper, we used the samtools view -f 4 command. However, since then we have updated our process to do another host depletion step after the initial extraction, using the piped command below:

samtools view -f 4 -O BAM $in_dir/$filename | 
samtools bam2fq - |
fastp -l 45 --stdin -w $cpus --stdout --interleaved_in | 
minimap2 -ax sr -t $cpus $db - |
samtools fastq -@ $cpus -f 12 -F 256 - -1 $out_dir/$base_name.R1.trimmed.fastq.gz -2 $out_dir/$base_name.R2.trimmed.fastq.gz

where $in_dir/$filename denotes the input file, $cpus denotes the number of cores for the job, $db denotes a minimap .mmi file containing the GRCh38 human genome and PhiX viral genome, and $out_dir/$base_name.R1.trimmed.fastq.gz denotes the outputted host-depleted forward read file (R2 is the reverse). Note also that this piped command has a hard cutoff of a minimum length of 45 bp (fastp -l 45), which may not be optimal for all studies.

Nonetheless, if you just want to follow the protocol in the original Nature paper, you can just use the samtools view -f 4 command to extract the unmapped reads from the pre-computed TCGA bam files.

wanglu2014 commented 2 years ago

Thank you for your detailed reply!

wanglu2014 commented 2 years ago

How did you remove human sequence, when samples only have fastq files?

gregpoore commented 2 years ago

All TCGA samples had precomputed bam files based on alignments to the human genome. The only fastq files came from the validation (plasma) study, and they were aligned to the human genome using the methods described in the paper (see Bioinformatic processing for plasma microbiome samples section in Methods, partially copied below):

"A total of 21,600,141,264 reads were generated on the single NovaSeq 6000 sequencing run across all samples. Of these, 19,046,611,360 reads were assigned to human samples (that is, negative and positive controls removed), and 2.186% of the total reads were classified as non-human. Raw sequencing data were demultiplexed and adaptor-trimmed using Atropos. Additional quality filtering was done using Trimmomatic with the following settings—(ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:7, MINLEN:50, TRAILING:20, AVGQUAL:20, SLIDINGWINDOW:20:20). An additional adaptor sequence consisting of a string of only G was added to the standard TruSeq adapters to remove trailing G stretches from the 5′ ends of reads. Read pairs were discarded if either mate mapped to the human genome (major-allele-SNP reference from 1000 Genomes Project) using Bowtie2 with the fast-local parameter set. Paired-end reads were then merged using FLASH with the following parameters— (minimum overlap: 20, maximum overlap: 150, mismatch ratio: 0.01).

The filtered, merged reads were then processed either by Kraken, using the same workflow and database (n = 59,974 microbial genomes) detailed above, or with SHOGUN as detailed here."

wanglu2014 commented 2 years ago

Thank you!