FredHutch / Galeano-Nino-Bullman-Intratumoral-Microbiota_2022

Analysis code used in Galeano Nino et al., Impact of Intratumoral Microbiota on Spatial and Cellular Heterogeneity in human cancer. 2022
MIT License
33 stars 10 forks source link

cannot find my_sample.visium.raw_matrix.genus.csv #2

Closed NicolaasVanRenne closed 1 year ago

NicolaasVanRenne commented 1 year ago

Hello, I'm trying to run your Visium pipeline. I ran Visium_pipeline.sh on my visium sample (that employs UMI_annotator.py - which is also provided).

I get the python output and the pathseq output, however, I do not see my_sample.visium.raw_matrix.genus.csv anywhere. I need this file in the next pipeline: Visium.R

The tutorial also talks about Pipeline_Visium.sh which is not provided -or it is the same as Visium_pipeline.sh. Could you confirm this or is there a Pipeline_Visium.sh? Confusing...

But again; which part of the code should yield my_sample.visium.raw_matrix.genus.csv? I do not see that anywhere in the code provided, or am I missing something.

Thanks for your help, I really feel lost here...

Nicolaas

hanruiw commented 1 year ago

Dear Nicolaas, sorry for confusion, I updated the name of pipeline in readme.md. In addition, could you please show me the list of output file from the python script?

NicolaasVanRenne commented 1 year ago

thanks for answering so quickly, much appreciated!

This is my python folder. Maybe it is simply malfunctioning on my sample, and I could try and reproduce your dataset first...

image

Could you share what I should get if it's running correctly?

Thanks! Nicolaas

hanruiw commented 1 year ago

Dear Nicolaas, could you please check if the PathSeq output is empty? I will try to replicate this issue on my side and get back to you.

NicolaasVanRenne commented 1 year ago

It is not empty: here are the files in the pathseq directory

image

hanruiw commented 1 year ago

Dear Nicolaas, I reran UMI_annotator.py :

图片

and below is the file list in the output folder:

图片

.genus.csv file contains the UMI matrix that can be used to annotate the Seurat object

NicolaasVanRenne commented 1 year ago

So while the Visium.R code talks about my_sample.visium.raw_matrix.genus.csv as shown below...

metadata_file = paste0(processed_foler,'/',each_sample,'.visium.raw_matrix.genus.csv')

This should actually be my_sample.visium.filtered_matrix.genus.csv

Is that correct? Confusing!

If so, then my .genus.csv file is empty... Maybe there are no bacterial reads in my sample?

I have mapped the CRC_16 sample with spaceranger and I am now trying to replicate the work. Let's see if that one works and work up from there.

Thank you Hanrui for your kind support

NicolaasVanRenne commented 1 year ago

There is progress :)

Visium.R works if I change it to filtered_matrix.genus.csv instead of raw_matrix.genus.csv

However, my sample_CRC_16.visium.filtered_matrix.genus.csv has considerably lower cells (ie. spots) than the raw version

(IMPORTANT: you seemed to have mislabeled your genus.csv files; CRC16 is OSCC_2 and vice versa!!! on https://github.com/FredHutch/Galeano-Nino-Bullman-Intratumoral-Microbiota_2022/tree/main/10X_Visium_samples)/output )

It seems that the raw version has more cells than the filtered. But if I load your genus.csv file and project it on my spots, I get exactly the same number of reads for Fusobacterium - that gives me great confidence I have done it right. (spaceranger + pathseq).

So, that leaves still the issue: why don't have I any bacterial reads? I could imagine my sample just has very low bacteria reads but zero seems very unlikely. Could it have something to do with the following?

--min-score-identity and --min-clipped-read-length based on your samples

How do I set min-score-identity and min-clipped-read-length according to my samples?

Nicolaas

hanruiw commented 1 year ago

Dear Nicolaas, sorry for confusion, I updated the pipeline code to output ''raw_matrix' instead of 'filtered_matrix' and the file names in output folder. The difference between raw matrix and filtered matrix is whether the calculation is based on 'capture spots' instead of 'all spots on the slide'. However, this does not affect our analysis because we used filtered_feature_bc_matrix/barcodes.tsv.gz in Seurat, so any spots other than captured spots are not counted. In our pipeline, unmapped reads with UB/CB tags from Spaceranger bam file and YP tag from PathSeq (connected based on read names) will be extracted to .readnamepath file, so if your .readnamepath file is empty, maybe it's because there's no reads have UB/CB tags and YP tag at the same time. Based on my opinion, if you lower the Pathseq thresholds (--min-score-identity and --min-clipped-read-length) you might get larger number of bacteria reads (PathSeq help: https://gatk.broadinstitute.org/hc/en-us/articles/9570234210075-PathSeqPipelineSpark and https://gatk.broadinstitute.org/hc/en-us/articles/360035889911--How-to-Run-the-Pathseq-pipeline)

NicolaasVanRenne commented 1 year ago

Hello again,

I'm not sure about my outcome. I get bacterial reads if I set my , but most of hem lack a YP-tag (not sure what YP tag means).

I set: --min-clipped-read-length 31 --min-score-identity .7

I think 31 is the lowest setting possible but not sure here... min score .7 but also unsure if it is low enough...

For my sample_XXX I get the following outcome:

sample_XXX [W::hts_idx_load2] The index file is older than the data file: / sample_XXX/outs/possorted_genome_bam.bam.bai len(dict_for_genus) = 3608 Total reads in pathseq bam = 4347879 Total reads in pathseq bam with YP tag = 10635 total cellranger bam reads = 213903991 total cellranger bam reads with UB CB tags (in-cell) = 185496985 total UNMAPPED cellranger bam reads with UB CB tags (in-cell) = 10047351 total cellranger reads with UB_CB_unmap Aligned to Pathseq reads with YP tags = (in-cell) 8489 Total unambigious UMI = 7897 total pathogen-associated gems = 2336 total filtered pathogen-associated cells = 2336 total number of cells = 2499

In the pathseq/sample_XXX.pathseq.complete.csv file I have all the reads listed: image

The virus reads are probably a false positive, but the bacteria might be real (or also false positive). In any case, in python/sample_XXX.visium.filtered_matrix.genus.csv, these bacterial reads are all lost. Where did they all go? Are they associatd with spots that are not retained in the spatial coordinates (and thus 100% sure false positive reads), or don't they appear on my barcodes for some settings I can adjust? Below are the ONLY columns I get, and apart from this vesiculvirus column, its all zero except for 1 cell (spot) with 1 read...

image

Probably there are just not enough "Total cellranger reads with UB_CB_unmap Aligned to Pathseq reads with YP tags" but what does it mean, and how can I increase them?

Below is the replication of the CRC_16 sample. Here you see a lot of UB_CB_unmap Aligned to Pathseq reads with YP tags.

sample_CRC_16 len(dict_for_genus) = 18470 Total reads in pathseq bam = 6279876 Total reads in pathseq bam with YP tag = 2645265 total cellranger bam reads = 174436315 total cellranger bam reads with UB CB tags (in-cell) = 143746173 total UNMAPPED cellranger bam reads with UB CB tags (in-cell) = 9439887 total cellranger reads with UB_CB_unmap Aligned to Pathseq reads with YP tags = (in-cell) 1309914 Total unambigious UMI = 84171 total pathogen-associated gems = 1418 total filtered pathogen-associated cells = 1418 total number of cells = 3104

Any idea's or suggestions?

Thank you

Nicolaas

hanruiw commented 1 year ago

Dear Nicolaas,

Reads from SpaceRanger bam file are assigned 'UB' and 'CB' tags (corrected UMI/barcode), then reads from PathSeq bam file are assigned 'YP' tag (The YP read tag lists the NCBI taxonomy IDs of any aligned species meeting the alignment identity criteria), and only reads contain all of 'UB', 'CB' and 'YP' tags will be extracted to .readnamepath file. I would suggest you to extract unmapped reads with 'UB' and 'CB' tags from Spaceranger bam file and rerun PathSeq on those reads to validate the result (PySam might be helpful in processing bam files). The detection of bacteria is associated with many factors, not all tissue contains high bacteria load. Unfortunately I'm not a wet lab person, I would suggest you to contact Prof. Susan Bullman or Dr. Jorge L Galeano for more details about tissue selection and wet lab process.

Best regards, Hanrui

hanruiw commented 1 year ago

Dear Nicolaas,

I'm going to close this issue, if you have any questions please don't hesitate reopen it and let me know.

Best wishes, Hanrui