GWW / scsnv

scSNV Mapping tool for 10X Single Cell Data
MIT License
22 stars 4 forks source link

Question about the output of scsnvmisc annotate #13

Closed chuiqin closed 1 year ago

chuiqin commented 1 year ago

I'm sorry to bother you again. After I executed the scsnvmisc annotate command, 4 files (pileup_annotated.h5, pileup_barcodes.txt.gz, pileup_filtering.txt, pileup_passed_snvs.txt.gz) were generated in the sample folder. However, the sample folder in the tutorial has so many files:

sample/pileup_annotated.h5
sample/pileup_barcodes.txt.gz
sample/pileup.txt.gz
sample/pileup_barcode_matrices.h5
sample/cells.h5
sample/summary.h5
sample/snv_map.txt.gz
sample/snv_base_counts.txt.gz
sample/snv_reads.txt.gz
sample/barcode_bases.txt.gz

The code and result of executing the scsnvmisc annotate command are as follows:

(scsnv) scsnvmisc annotate -r /home/data/cq/scsnv/repeatrmsk.txt.gz \
> -d /home/data/cq/scsnv/1000GENOMES/1000GENOMES.txt.gz \
> -e /home/data/cq/scsnv/REDIportal_hg38.txt.gz \
> sample/pileup
Processing sample/pileup
Kept 397778 out of 415823
[ True True True ... True True True ] (415823,) (415823, 6991) (415823, 6991)
Parsing 1000 Genomes file
Annotating the SNVs
Found in 1000 Genomes: 0
Annotating repeats
Building NCLS for each reference
Querying each SNV
SNVs with a repeat masker overlap: 301853
Annotating edits
REDIportal 143148
(scsnv)
(scsnv) tree sample/
sample/
├── pileup_annotated.h5
├── pileup_barcode_matrices.h5
├── pileup_barcodes.txt.gz
├── pileup_filtering.txt
├── pileup_passed_snvs.txt.gz
└── pileup.txt.gz

0 directories, 6 files
(scsnv) ls -lh sample/
total 1.5G
-rw-rw-r-- 1 cq cq 110M Oct 20 16:19 pileup_annotated.h5
-rw-rw-r-- 1 cq cq 1.4G Oct 20 12:19 pileup_barcode_matrices.h5
-rw-rw-r-- 1 cq cq 109K Oct 20 12:19 pileup_barcodes.txt.gz
-rw-rw-r-- 1 cq cq 252 Oct 20 15:44 pileup_filtering.txt
-rw-rw-r-- 1 cq cq 7.7M Oct 20 15:48 pileup_passed_snvs.txt.gz
-rw-rw-r-- 1 cq cq 7.1M Oct 20 12:19 pileup.txt.gz

The complete execution code is as follows:

# Build the index
/home/data/cq/scsnv/scsnv/build/src/scsnv index \
-g /home/data/cq/opt/refdata-gex-GRCh38-2020-A/genes/genes.gtf \
-r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
index_prefix/index_prefix
# scsnv pileup
/home/data/cq/scsnv/scsnv/build/src/scsnv pileup -c -l V3 -i index_prefix/index_prefix \
-r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
-o sample/pileup -p /home/data/cq/ALL/04.barcode/JYC.txt.gz \
-t 4 -x 4 /home/data/cq/opt/JYC/outs/possorted_genome_bam.bam
# annotate
scsnvmisc annotate -r /home/data/cq/scsnv/repeatrmsk.txt.gz \
-d /home/data/cq/scsnv/1000GENOMES/1000GENOMES.txt.gz \
-e /home/data/cq/scsnv/REDIportal_hg38.txt.gz \
sample/pileup

To eliminate effects from the samples, I also tested other single-cell transcriptome sequencing samples. These samples are both 10X V3 and 10X V2 samples. But they both output very few results.

(scsnv) ls gxm/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz
(scsnv) ls JYC/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz
(scsnv) ls lyh/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz
(scsnv) ls SRR9264349/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz
(scsnv) ls SRR9264350/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz
(scsnv) ls SRR9264353/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz
(scsnv) ls SRR9264354/
pileup_annotated.h5  pileup_barcode_matrices.h5  pileup_barcodes.txt.gz  pileup_filtering.txt  pileup_passed_snvs.txt.gz  pileup.txt.gz

Am I missing any steps? Also, after checking 4 files (pileup_annotated.h5, pileup_barcodes.txt.gz, pileup_filtering.txt, pileup_passed_snvs.txt.gz), I can't make good use of the output of scsnv package. If I want to get a matrix (rows are cells and columns are the SNVs of this cell) which output file should I utilize? Can your help me again?

Supplement: This single-cell transcriptome-sequencing sample was an acute B-cell leukemia sample that was found to be 80% naive on flow cytometry.

GWW commented 1 year ago

Hi,

Some of the additional files may be from the SNV co-expression analysis, which may not be relevant to your analyses.

I have added a simple command to the scSNV repository (you'll need to reclone it and reinstall the python scsnvmisc package).

The command works as follows:

scsnv2mtx pileup_annotated.h5 out_dir

This will write the following files:

alts.mtx A sparse market matrix for the alt counts refs.mtx A sparse market matrix for the ref counts barcodes.txt The barcodes for each column of the mtx file snvs.csv the SNV data for each row of the mtx files

The mtx files are in the format N (snvs) x M (barcodes).

This data is unfiltered and I recommend removing edits etc (that data is in the snvs.csv file) and I would remove SNVs detected in very few cells.

Gavin

chuiqin commented 1 year ago

Thank you for your selfless help. I am very touched by the enthusiasm of your reply. I would love to introduce this useful tool to more people.

However, I am still confused about the output of scsnv2mtx.

If I want to count the number of SNV mutations in cells, do I only use the three files alts.mtx, barcodes.txt and snvs.csv. Through these three files, I know whether the SNV is expressed in the cell and the count of the SNV in this cell. I only have a vague understanding of the difference between alts.mtx and ref.mtx files. In my opinion, ref is the allele in the reference genome and alts are the other alleles found at this locus that are distinct from the reference genome. Therefore, the ref.mtx file records the count of loci in each cell that are identical to the reference genome. The alts.mtx file records the count of loci in each cell that are different from the reference genome. Is this understanding correct?

In addition, I have a new idea for the use of scsnv, I hope to get your help.

In this article (https://www.nature.com/articles/s41467-022-28803-w), the authors used SNV to identify tumor cells as follows:

ScType utilizes raw scRNA-seq data to identify single-nucleotide variants (SNVs) present in each cell. ScType processes the raw input scRNA-seq BAM file using samtools and call the SNVs using Strelka2. Next, ScType connects the SNV to each cluster using vartrix. As an extended feature, ScType also calculates the sum of total number of SNV present in the COSMIC cancer census genes as the combined SNV score (ScType SNV score) for each cluster in a cancerous tissue profile. More specifically, ScType SNV score summarizes the number of point mutations in the cancer genes for each cell type, calculated as the percentage of SNVs above the median SNV value within the particular cell cluster.

Briefly, the authors of this article found that the number of SNVs is a potential indicator for distinguishing normal cells from malignant cells. And I would like to refer to this article for a parody.

Although I saw a description of the SNV location in the file snvs.csv, my poor bioinformatics foundation prevented me from combining the SNV in the output of the scsnv package with the gene that should belong to it. I hope to get your help.

I hope to generate the following 4 files from the pileup_annotated.h5 file:

  1. Matrix (columns are cells, behavioral genes, and the value is the number of SNVs. This matrix can clearly know the number of SNVs that occur in a gene in a cell.)
  2. List (genes and SNVs belonging to genes. This list can easily query which SNVs are contained in this gene in this sample. This means that the relationship between SNVs annotated to genes is clear.)
  3. Matrix (The first column is SNV, and the second column is the gene to which the SNV is annotated. This is an overview matrix, you can see which SNVs are annotated to genes, and which SNVs cannot be annotated to genes.)
  4. Matrix (the column is the cell, the row is the SNV, and the value is the count number of SNV in the cell. Can this matrix be obtained from the three files alts.mtx, barcodes.txt and snvs.csv. )

I believe that combining SNVs with genes will greatly expand the application of the scsnv package, especially in the identification of malignant cells of the blood system. Many hematological tumors have few obvious CNV changes, so the copyKat and inferCNV packages are often limited in identifying malignant cells. And almost all hematological tumors have SNV changes, which suggests that SNV is a possible indicator to distinguish normal cells from malignant cells. The scsnv can simplify the process of binding SNVs to genes. When we have the number and sum of SNVs, we can distinguish normal cells from malignant cells in various ways (such as classification trees in machine learning, gene set enrichment methods, gene set scoring methods, etc.). The linkage of scsnv with other methods makes this step infinitely possible. What a wonderful thing this must be!

Looking forward to your reply! Thanks again!

GWW commented 1 year ago

Hi,

Exactly, the two matrices inform you on the number of molecules supporting the reference and alternative base for each mutation in each barcode. The market matrix format is a sparse format because many of the mutations have 0 counts in both the reference and alternative base. You can read more about the format here. There are parsers for this format in both python and R ( I am not sure what is used with R). It should be relatively straight forward to calculate the number of SNVs with at least 1 alternative count for each cell from these matrices. I would filter out rows that are marked as RNA edits in the snvs.csv file.

chuiqin commented 1 year ago

Thank you very much for your reply. I would like to ask if is there any way to annotate SNVs to genes. I want to extract the number of SNVs in each gene (mainly cancer-common genes) in each cell to differentiate between normal cells and tumor cells in the blood system. Can you give me some suggestions? Thanks!

GWW commented 1 year ago

I am sure there are open source tools you can use for that such as bedtools.