FredHutch / Galeano-Nino-Bullman-Intratumoral-Microbiota_2022

Analysis code used in Galeano Nino et al., Impact of Intratumoral Microbiota on Spatial and Cellular Heterogeneity in human cancer. 2022
MIT License
33 stars 10 forks source link

pathseq bam is missing reads in the literature #10

Closed jr-ran closed 1 year ago

jr-ran commented 1 year ago

In the CRC 16 sample results, the bam results I obtained for pathseq did not contain the reads in the literature, e.g AAACACCAATAACTGC-1+TAGCGGGATGTT

$ samtools view 01.Pathseq/CRC_16.output.pathseq.bam|egrep "VH00699:3:AAAHK77HV:1:2105:49816:8308"

hanruiw commented 1 year ago

Dear Jianrong,

Thank you for your interest in our paper! In order to be annotated as a bacteria read, a read has to contain valid UB and CB tags from Spaceranger and a YP tag from PathSeq. Would you mind share more details about your question and the result for $ samtools view 01.Pathseq/CRC_16.output.pathseq.bam|egrep "VH00699:3:AAAHK77HV:1:2105:49816:8308"?

Best regards, Hanrui

jr-ran commented 1 year ago

Dear Hanrui,

My reference genome is download from pathseqDB, and gatk command is

gatk-4.1.3.0/gatk --java-options -Xmx100G \ PathSeqPipelineSpark \ --input ${dir}/CRC_16/outs/possorted_genome_bam.bam \ --filter-bwa-image ${refdir}/host/pathseq_host.fa.img \ --kmer-file ${refdir}/host/pathseq_host.bfi \ --min-clipped-read-length 60 \ --microbe-fasta ${refdir}/micro/pathseq_microbe.fa \ --microbe-bwa-image ${refdir}/micro/pathseq_microbe.fa.img \ --taxonomy-file ${refdir}/micro/pathseq_taxonomy.db \ --output ${odir}/01.Pathseq/CRC_16.output.pathseq.bam \ --scores-output ${odir}/01.Pathseq/CRC_16.output.pathseq.csv --is-host-aligned false \ --filter-duplicates false \ --min-score-identity .7

Bam results as input were analyzed using umi_annotater.py.

each_sample=CRC_16 python UMI_annotator.py ${bam_path}/${each_sample}/outs/possorted_genome_bam.bam \ '' \ ${bam_path}/${each_sample}/outs/raw_feature_bc_matrix/barcodes.tsv.gz \ ${pathseq_path}/${each_sample}.output.pathseq.bam \ ${pathseq_path}/${each_sample}.output.pathseq.csv \ ${out_path}/${each_sample}.visium.raw_matrix.readname \ ${out_path}/${each_sample}.visium.raw_matrix.unmap_cbub.bam \ ${out_path}/${each_sample}.visium.raw_matrix.unmap_cbub.fasta \ ${out_path}/${each_sample}.visium.raw_matrix.list \ ${out_path}/${each_sample}.visium.raw.raw_matrix.readnamepath \ ${out_path}/${each_sample}.visium.raw_matrix.genus.cell \ ${out_path}/${each_sample}.visium.raw_matrix.genus.csv \ ${out_path}/${each_sample}.visium.raw_matrix.validate.csv

But the resulting document is inconsistent with the literature.The CRC_16.visium.raw_matrix.validate.csv file has 145518 lines recorded in the literature, but only 75129 lines were recorded in my results. Total UMI is 40402 in my result, and the literature is 84004. Total reads is 340883 in my result, and the literature is 795040.

example: Genus,Number_of_Cells,Number_of_UMI,Number_of_reads Fusobacterium,644,16075,100735

hanruiw commented 1 year ago

Dear Jianrong,

Thanks for your reply, would you mind also share the printed out information from UMIannotator.py?

Hanrui

jr-ran commented 1 year ago

hello hanrui, pathseq result is

$ ll -h 01.Pathseq/ -rw-r--r-- 1 2.3G Apr 10 16:37 01.Pathseq/CRC_16.output.pathseq.bam -rw-r--r-- 1 4.6M Apr 10 16:37 01.Pathseq/CRC_16.output.pathseq.bam.bai -rw-r--r-- 1 16K Apr 10 16:37 01.Pathseq/CRC_16.output.pathseq.bam.sbi -rw-r--r-- 1 4.3M Apr 10 16:35 01.Pathseq/CRC_16.output.pathseq.csv

UMIannotator.py result is

$ ll -h -rw-r--r-- 1 223K Apr 10 16:56 CRC_16.visium.raw_matrix.genus.cell -rw-r--r-- 1 1.1M Apr 10 16:56 CRC_16.visium.raw_matrix.genus.csv -rw-r--r-- 1 66K Apr 10 16:56 CRC_16.visium.raw_matrix.list -rw-r--r-- 1 259M Apr 10 16:47 CRC_16.visium.raw_matrix.readname -rw-r--r-- 1 1002M Apr 10 16:56 CRC_16.visium.raw_matrix.unmap_cbub.bam -rw-r--r-- 1 1.7G Apr 10 16:56 CRC_16.visium.raw_matrix.unmap_cbub.fasta -rw-r--r-- 1 3.3M Apr 10 16:56 CRC_16.visium.raw_matrix.validate.csv -rw-r--r-- 1 355M Apr 10 16:56 CRC_16.visium.raw.raw_matrix.readnamepath

hanruiw commented 1 year ago

Dear Jianrong,

I think this issue might caused by Pathseq. In our analysis, total number of reads in PathSeq bam file is 6,279,844

When running UMI_annotator.py, it should print out number of reads/UMIs in each step, would you mind to share it here?

Best regards, Hanrui

jr-ran commented 1 year ago

Dear Hanrui, umi_annotator.py running information is

len(dict_for_genus) = 4405 Total reads in pathseq bam = 2218034 Total reads in pathseq bam with YP tag = 427295 total cellranger bam reads = 174436315 total cellranger bam reads with UB CB tags (in-cell) = 170444671 total UNMAPPED cellranger bam reads with UB CB tags (in-cell) = 13494850 total cellranger reads with UB_CB_unmap Aligned to Pathseq reads with YP tags = (in-cell) 418772 Total unambigious UMI = 42632 total pathogen-associated gems = 2948 total filtered pathogen-associated cells = 2948 total number of cells = 4990

jr-ran commented 1 year ago

Dear Hanrui, I agree with you that this issue might caused by Pathseq. The reference genome size is as follows, and other parameters are consistent with the literature

-rw-r--r-- 1 143545701 Nov 30 2017 pathseq_microbe.dict -rw-r--r-- 1 69389179145 Oct 22 2017 pathseq_microbe.fa -rw-r--r-- 1 61318826 Oct 22 2017 pathseq_microbe.fa.fai -rw-r--r-- 1 119895871349 Nov 29 2017 pathseq_microbe.fa.img -rw-r--r-- 1 21548740 Dec 5 2017 pathseq_taxonomy.db -rw-r--r-- 1 8589934650 Nov 29 2017 pathseq_host.bfi -r--r--r-- 1 95177686 Oct 5 2017 pathseq_host.dict -r--r--r-- 1 3792941240 Oct 5 2017 pathseq_host.fa -r--r--r-- 1 52425722 Oct 5 2017 pathseq_host.fa.fai -r--r--r-- 1 6559046198 Oct 5 2017 pathseq_host.fa.img

hanruiw commented 1 year ago

Dear Jianrong,

The size of our databases are exactly the same. I reran the pipeline and get the same results as before (total number of reads in PathSeq bam file is 6,279,844), while we have the same number of reads from Spaceranger bam file (174,436,315).

Hanrui

jr-ran commented 1 year ago

Dear Hanrui, Thanks for your replay. May I ask if it is convenient to share the command for rerun pathseq?

hanruiw commented 1 year ago

Dear Jianrong,

The code I used to rerun PathSeq: gatk --java-options "-Xmx750g" PathSeqPipelineSpark \ --input ${file} \ --filter-bwa-image ${pathseqdb}/pathseq_host.fa.img \ --kmer-file ${pathseqdb}/pathseq_host.bfi \ --min-clipped-read-length 60 \ --microbe-fasta ${pathseqdb}/pathseq_microbe.fa \ --microbe-bwa-image ${pathseqdb}/pathseq_microbe.fa.img \ --taxonomy-file ${pathseqdb}/pathseq_taxonomy.db \ --output ${outpath}/${samplename}.pathseq.complete.bam \ --scores-output ${outpath}/${samplename}.pathseq.complete.csv \ --is-host-aligned false \ --filter-duplicates false \ --min-score-identity .7

When you have time, could you please double check if your PathSeq completed correctly?

Hanrui

hanruiw commented 1 year ago

Dear Jianrong,

As we discussed in the email, this issue was caused by missing parameters of --filter-duplicates false and --min-score-identity .7 when you were running PathSeq. Since it is solved, I'm going to close this issue, but please let us know if you have any other questions!

Best regards, Hanrui