FredHutch / Galeano-Nino-Bullman-Intratumoral-Microbiota_2022

Analysis code used in Galeano Nino et al., Impact of Intratumoral Microbiota on Spatial and Cellular Heterogeneity in human cancer. 2022
MIT License
33 stars 10 forks source link

File info #12

Closed AnnaAntonatouPap closed 1 year ago

AnnaAntonatouPap commented 1 year ago

I have a rather simple question.

I want to know the column names of the "matrix.readnamepath" and " matrix.readname" files

I guess in the latter the 3nd column is the mapping quality?

Many thanks in advance, A

hanruiw commented 1 year ago

Hi,

For visium part, "readname" file contains read-level PathSeq annotation information, the first column is read names, the second column contains YP info from Pathseq bam (genus level and species level taxa id), and the last column is mapping quality.

The idea for the " matrix.readname" file is to extract read-level information from Cellranger bam file and PathSeq bam file, the first column is read name, the second column is CB (corrected barcode), the third column is UB (corrected UMI), the forth column is YP from Pathseq bam (genus level and species level taxa id), the fifth column is mapping quality, and the sixth is the genus level annotation (based on YP). Good luck with your analysis and please let us know if you have any other questions!

Best regards, Hanrui

AnnaAntonatouPap commented 1 year ago

Thanks a lot for the quick answer! So the ones that are included in the final result are the ones with mapping quality higher than ?

Also I have one more question about the Pathseq run. You are commenting that we need to change the arguments about the min-clipped and identity based one our samples, could you give me a hint how to choose these argument ? based on which criteria?

Best, A.

hanruiw commented 1 year ago

Hi,

Thank you for your question. I’d like to clarify the use of mapping quality in our analysis. For the 10X Visium samples, the reads used in the analysis are 'unmapped human reads'. These reads have been assigned a mapping quality score by PathSeq.

In our pipeline for 10X Visium data, we did not set a minimum mapping quality threshold for including reads in the final result. Instead, we used the highest score from the 'mapping quality' to annotate the UMI. Despite the variations in mapping quality, our results are robust due to the reliability of PathSeq annotation at the genus level.

For more information on how mapping quality is computed, I recommend checking the MAPQ supplementary materials. It's important to note that mapping quality takes into account several factors, including the probability of contamination, the effect of mapping heuristics, and the error due to the repetitiveness of the reference.

Regarding min-clipped and identity in PathSeq analysis, I recommend reading some PathSeq materials to better understand PathSeq annotation: https://gatk.broadinstitute.org/hc/en-us/articles/360035889911--How-to-Run-the-Pathseq-pipeline https://gatk.broadinstitute.org/hc/en-us/articles/360037224472-PathSeqPipelineSpark And I would suggest adjusting those thresholds based on sequencing read length and read quality.

I hope this clarifies your question, but please don't hesitate to reach out if you have any further queries!

Best regards, Hanrui

hanruiw commented 1 year ago

Also, as the author of GATK-PathSeq mentioned in https://github.com/broadinstitute/gatk/issues/6818 : "--min-score-identity is, as you noted, a fractional value of the read length between 0 and 1 that is applied during the final scoring phase to adjust how stringently aligned reads should be categorized as either known microbial or non-host/non-microbial (unknown)." I hope this helps!

AnnaAntonatouPap commented 1 year ago

Thanks a lot for the detailed answer, I a bit confused tho, so the mapping quality in readname and matrix.readnamepath is the one from cellranger not the one from pathseq( meaning the mapping quality to the corresponding microbial genome ). The thing I am trying to understand which microbial- derived reads are in the final input file for Seurat, because I want to blast them and see the alignment.

hanruiw commented 1 year ago

Sorry for the confusion, the mapping quality in readname and matrix.readnamepath is the one from PathSeq bam file not Cellranger bam file.

AnnaAntonatouPap commented 1 year ago

Ok then, so from these reads (in these 2 files) all of them are used in the downstream analysis , regardless of the mapping quality?

hanruiw commented 1 year ago

Thank you for your follow-up question. To clarify, in our analysis we select the read with the highest score to annotate each UMI. In cases where multiple reads tied for the highest score, a single read was randomly selected for annotation. This means that while all reads contribute to the selection process, only the read with the highest score (or a randomly selected read in the event of a tie) is used for downstream analysis. This approach ensures that we are using the most reliable information available for each UMI in our analysis. I hope this answers your question, but please feel free to ask if you have further questions or need additional clarification.

AnnaAntonatouPap commented 1 year ago

Many many thanks for all the info you gave me. Just to summarize then , mapping quality in both readname and matrix.readnamepath are from pathseq ( meaning how well this read maps to x-microbial genome). So if I want to go through the reads that was used to annotate each UMI , I will extract for each UMI tag ( 3rd column matrix.readnamepath) the read with the highest mapping quality ? I am asking because I want to go deeper on these reads with blast and IGV, So I need to be sure which reads to use

Many thanks again for all the help.

AnnaAntonatouPap commented 1 year ago

Super. I have one last question. Will you trust the results from PathSeq more than the results of just BLAST? From what I have understood, Pathseq is more strict to what will report as microbial read, which is good for avoiding false positives, am I right?

Also some of the taxa from the database curated from PathSeq-GATK ( https://software.broadinstitute.org/pathseq/Downloads.html) are not included in BLAST nt/nr database

hanruiw commented 1 year ago

For our 10X Visium samples, we used reads with highest mapping quality to annotate bacteria UMI, but only took those annotations that unambiguous at genus level.

hanruiw commented 1 year ago

I'm afraid we are not able to provide you solid answer for BLAST vs PathSeq since we didn't use BLAST in this project. Good luck with you analysis!

Best regards, Hanrui