Closed jr-ran closed 1 year ago
Dear Jianrong,
Thank you for your interest in our paper! In order to be annotated as a bacteria read, a read has to contain valid UB
and CB
tags from Spaceranger and a YP
tag from PathSeq. Would you mind share more details about your question and the result for $ samtools view 01.Pathseq/CRC_16.output.pathseq.bam|egrep "VH00699:3:AAAHK77HV:1:2105:49816:8308"
?
Best regards, Hanrui
Dear Hanrui,
My reference genome is download from pathseqDB, and gatk command is
gatk-4.1.3.0/gatk --java-options -Xmx100G \ PathSeqPipelineSpark \ --input ${dir}/CRC_16/outs/possorted_genome_bam.bam \ --filter-bwa-image ${refdir}/host/pathseq_host.fa.img \ --kmer-file ${refdir}/host/pathseq_host.bfi \ --min-clipped-read-length 60 \ --microbe-fasta ${refdir}/micro/pathseq_microbe.fa \ --microbe-bwa-image ${refdir}/micro/pathseq_microbe.fa.img \ --taxonomy-file ${refdir}/micro/pathseq_taxonomy.db \ --output ${odir}/01.Pathseq/CRC_16.output.pathseq.bam \ --scores-output ${odir}/01.Pathseq/CRC_16.output.pathseq.csv --is-host-aligned false \ --filter-duplicates false \ --min-score-identity .7
Bam results as input were analyzed using umi_annotater.py.
each_sample=CRC_16 python UMI_annotator.py ${bam_path}/${each_sample}/outs/possorted_genome_bam.bam \ '' \ ${bam_path}/${each_sample}/outs/raw_feature_bc_matrix/barcodes.tsv.gz \ ${pathseq_path}/${each_sample}.output.pathseq.bam \ ${pathseq_path}/${each_sample}.output.pathseq.csv \ ${out_path}/${each_sample}.visium.raw_matrix.readname \ ${out_path}/${each_sample}.visium.raw_matrix.unmap_cbub.bam \ ${out_path}/${each_sample}.visium.raw_matrix.unmap_cbub.fasta \ ${out_path}/${each_sample}.visium.raw_matrix.list \ ${out_path}/${each_sample}.visium.raw.raw_matrix.readnamepath \ ${out_path}/${each_sample}.visium.raw_matrix.genus.cell \ ${out_path}/${each_sample}.visium.raw_matrix.genus.csv \ ${out_path}/${each_sample}.visium.raw_matrix.validate.csv
But the resulting document is inconsistent with the literature.The CRC_16.visium.raw_matrix.validate.csv file has 145518 lines recorded in the literature, but only 75129 lines were recorded in my results. Total UMI is 40402 in my result, and the literature is 84004. Total reads is 340883 in my result, and the literature is 795040.
example: Genus,Number_of_Cells,Number_of_UMI,Number_of_reads Fusobacterium,644,16075,100735
Dear Jianrong,
Thanks for your reply, would you mind also share the printed out information from UMIannotator.py?
Hanrui
hello hanrui, pathseq result is
$ ll -h 01.Pathseq/ -rw-r--r-- 1 2.3G Apr 10 16:37 01.Pathseq/CRC_16.output.pathseq.bam -rw-r--r-- 1 4.6M Apr 10 16:37 01.Pathseq/CRC_16.output.pathseq.bam.bai -rw-r--r-- 1 16K Apr 10 16:37 01.Pathseq/CRC_16.output.pathseq.bam.sbi -rw-r--r-- 1 4.3M Apr 10 16:35 01.Pathseq/CRC_16.output.pathseq.csv
UMIannotator.py result is
$ ll -h -rw-r--r-- 1 223K Apr 10 16:56 CRC_16.visium.raw_matrix.genus.cell -rw-r--r-- 1 1.1M Apr 10 16:56 CRC_16.visium.raw_matrix.genus.csv -rw-r--r-- 1 66K Apr 10 16:56 CRC_16.visium.raw_matrix.list -rw-r--r-- 1 259M Apr 10 16:47 CRC_16.visium.raw_matrix.readname -rw-r--r-- 1 1002M Apr 10 16:56 CRC_16.visium.raw_matrix.unmap_cbub.bam -rw-r--r-- 1 1.7G Apr 10 16:56 CRC_16.visium.raw_matrix.unmap_cbub.fasta -rw-r--r-- 1 3.3M Apr 10 16:56 CRC_16.visium.raw_matrix.validate.csv -rw-r--r-- 1 355M Apr 10 16:56 CRC_16.visium.raw.raw_matrix.readnamepath
Dear Jianrong,
I think this issue might caused by Pathseq. In our analysis, total number of reads in PathSeq bam file is 6,279,844
When running UMI_annotator.py, it should print out number of reads/UMIs in each step, would you mind to share it here?
Best regards, Hanrui
Dear Hanrui, umi_annotator.py running information is
len(dict_for_genus) = 4405 Total reads in pathseq bam = 2218034 Total reads in pathseq bam with YP tag = 427295 total cellranger bam reads = 174436315 total cellranger bam reads with UB CB tags (in-cell) = 170444671 total UNMAPPED cellranger bam reads with UB CB tags (in-cell) = 13494850 total cellranger reads with UB_CB_unmap Aligned to Pathseq reads with YP tags = (in-cell) 418772 Total unambigious UMI = 42632 total pathogen-associated gems = 2948 total filtered pathogen-associated cells = 2948 total number of cells = 4990
Dear Hanrui, I agree with you that this issue might caused by Pathseq. The reference genome size is as follows, and other parameters are consistent with the literature
-rw-r--r-- 1 143545701 Nov 30 2017 pathseq_microbe.dict -rw-r--r-- 1 69389179145 Oct 22 2017 pathseq_microbe.fa -rw-r--r-- 1 61318826 Oct 22 2017 pathseq_microbe.fa.fai -rw-r--r-- 1 119895871349 Nov 29 2017 pathseq_microbe.fa.img -rw-r--r-- 1 21548740 Dec 5 2017 pathseq_taxonomy.db -rw-r--r-- 1 8589934650 Nov 29 2017 pathseq_host.bfi -r--r--r-- 1 95177686 Oct 5 2017 pathseq_host.dict -r--r--r-- 1 3792941240 Oct 5 2017 pathseq_host.fa -r--r--r-- 1 52425722 Oct 5 2017 pathseq_host.fa.fai -r--r--r-- 1 6559046198 Oct 5 2017 pathseq_host.fa.img
Dear Jianrong,
The size of our databases are exactly the same. I reran the pipeline and get the same results as before (total number of reads in PathSeq bam file is 6,279,844), while we have the same number of reads from Spaceranger bam file (174,436,315).
Hanrui
Dear Hanrui, Thanks for your replay. May I ask if it is convenient to share the command for rerun pathseq?
Dear Jianrong,
The code I used to rerun PathSeq: gatk --java-options "-Xmx750g" PathSeqPipelineSpark \ --input ${file} \ --filter-bwa-image ${pathseqdb}/pathseq_host.fa.img \ --kmer-file ${pathseqdb}/pathseq_host.bfi \ --min-clipped-read-length 60 \ --microbe-fasta ${pathseqdb}/pathseq_microbe.fa \ --microbe-bwa-image ${pathseqdb}/pathseq_microbe.fa.img \ --taxonomy-file ${pathseqdb}/pathseq_taxonomy.db \ --output ${outpath}/${samplename}.pathseq.complete.bam \ --scores-output ${outpath}/${samplename}.pathseq.complete.csv \ --is-host-aligned false \ --filter-duplicates false \ --min-score-identity .7
When you have time, could you please double check if your PathSeq completed correctly?
Hanrui
Dear Jianrong,
As we discussed in the email, this issue was caused by missing parameters of --filter-duplicates false and --min-score-identity .7 when you were running PathSeq. Since it is solved, I'm going to close this issue, but please let us know if you have any other questions!
Best regards, Hanrui
In the CRC 16 sample results, the bam results I obtained for pathseq did not contain the reads in the literature, e.g AAACACCAATAACTGC-1+TAGCGGGATGTT
$ samtools view 01.Pathseq/CRC_16.output.pathseq.bam|egrep "VH00699:3:AAAHK77HV:1:2105:49816:8308"