epi2me-labs / wf-single-cell

Other
75 stars 39 forks source link

Visium v1 Spatial analysis #138

Closed MustafaElshani closed 1 month ago

MustafaElshani commented 3 months ago

Ask away!

The pipeline has run successfully using --kit 'visium:v1' with 2.2.0 The output generated are as follows

config_stats.json
gene.expression.mito-per-cell.tsv
gene.expression.umap.0.tsv
gene.expression.umap.1.tsv
gene.expression.umap.2.tsv
gene_processed_feature_bc_matrix
gene_raw_feature_bc_matrix
kneeplot.png
read_tags.tsv
tagged.bam
tagged.bam.bai
transcriptome.fa.gz
transcriptome.gff.gz
transcript.expression.umap.0.tsv
transcript.expression.umap.1.tsv
transcript.expression.umap.2.tsv
transcript_processed_feature_bc_matrix
transcript_raw_feature_bc_matrix
whitelist.tsv

These outputs are different from the outputs described in the README is this expected?

The main question is what output should be used to analyse the data Spatially? What tools are available for spatial analysis. Is there away to preparefastqs to follow up with using 10x's Space Ranger?

Mustafa

nrhorner commented 3 months ago

Hi @MustafaElshani , thanks for you post.

Yes some of the outputs have changed and the README needs updating to reflect this. This will be fixed shortly.

The initial support for visium data by the workflow has spot barcodes in read_tags.tsv (corrected_barcode) and the feature x barcode (spot) matrices in MEX format (gene_processed_feature_bc_matrix/, gene_raw_feature_bc_matrix/, transcript_processed_feature_bc_matrix/, transcript_raw_feature_bc_matrix/)

The next release will have some support for visualisation of the results.

I've not investigated downstream tools for spatial omics.

What is special about FASTQs that are used as input to Space Ranger?

Neil

MustafaElshani commented 3 months ago

Hi Neil @nrhorner,

Well, it was a fun weekend!

The goal was to determine the positions of the barcode spots relative to the scanned image. 10x Genomics' Spaceranger achieves this by finding the orientation of the fiducial marks and tissue, resulting in the creation of the tissue_positions.csv file, which contains the following:

| Barcode              | In Tissue | Array Row | Array Col | Pxl Row in Fullres | Pxl Col in Fullres  |
|----------------------|-----------|-----------|-----------|--------------------|---------------------|
| ACGCCTGACACGCGCT-1   | 0         | 0         | 0         | 33419.12311436416   | 2656.359174250035    |
| TACCGATCCAACACTT-1   | 0         | 1         | 1         | 33190.0081321969    | 3050.5933558348743   |
| ATTAAAGCGGACGAGC-1   | 0         | 0         | 2         | 32964.71728552074   | 2654.161395232163    |

Here, In Tissue is a binary value, where 0 indicates no tissue, and 1 indicates the presence of tissue.

Spaceranger is optimized for Illumina reads, and even then, it requires the bam2fastq converter . After proper formatting, the resulting FASTQs can be used as inputs to Spaceranger.

[!IMPORTANT] The FASTQs must follow the format sample_S1_L001_R1_001.fastq.gz and sample_S1_L001_R2_001.fastq.gz with the following structure:

sample_S1_L001_R1_001.fastq.gz

+-----------------------------------------------------------+
| read_id 1:N:0:GATAATACCG+TTTACGTGGT    # Keep the i5 and i7 as is for all reads   |
| uncorrected_barcode(CR)+uncorrected_umi(UR)                                       |
| +                                                                                |
| quality_barcode(CY)+quality_umi(UY)                                               |
+-----------------------------------------------------------+

and the sample_S1_L001_R2_001.fastq.gz

+--------------------------------------------------------------------------------+
| read_id 2:N:0:GATAATACCG+TTTACGTGGT  # Keep the i5 and i7 as is for all reads  |
| main sequence                                                                 |
| +                                                                             |
| Quality of sequence                                                           |
+--------------------------------------------------------------------------------+

To generate the above files, I used the tagged.bam file from the wf-single-cell workflow along with the following bam2fastq_ONT.sh script, which generates the FASTQs for input into Spaceranger.

After generating the FASTQs, I ran the following command:

    spaceranger count --id SR_610_DMSO \
                      --transcriptome ../../../References_10xGenomics/refdata-gex-GRCh38-2024-A \
                      --create-bam true \
                      --slide V13A03-301 \
                      --area C1 \
                      --image ../../../images/Visium-jpg-btiff/SR_610_DMSO_V13A03-301_C1.tif \
                      --output-dir ./spaceranger_out \
                      --fastqs ./fastqs/ \
                      --r2-length 500 \
                      --loupe-alignment ../../../images/Visium-jpg-btiff/SR_610_DMSO_V13A03-301_C1.json 

[!TIP] It was important to set --r2-length 500, as the Spaceranger pipeline is not designed for long reads.

This is a workaround, but it allowed me to generate the tissue_positions.csv file for the image I have. I'm now planning to use the wf-single-cell matrices for analysis with Seurat.

Until 10x Genomics supports long reads, this is the only plausible approach I've found that doesn't require using short reads.

Mustafa

nrhorner commented 2 months ago

Hi Mustafa

I'm glad you got that working. Thanks for sharing the guide and your code. It will be useful.

Thanks,

Neil