diogomribeiro / sc_cop

Single cell local gene co-expression project
MIT License
0 stars 0 forks source link

Doubts for the format of `gene expression matrix` `gencode_v19.bed` and `peak_matrix.tsv` in `ShareSeqCoex.py` #6

Open jiangpuxuan opened 1 year ago

jiangpuxuan commented 1 year ago

It is going very well with the sc_cop package. When I came to use ShareSeqCoex.py to analyze my scATAC and scRNA data, I met some problems about the format of gene expression matrix, gencode_v19.bed and peak_matrix.tsv.

Gene expression matrix

        Read gene expression matrix in sparse format.

        Example format (TSV, header):
            gene    cell    value
            ENSG00000238009 R1.02,R2.10,R3.20,P1.51 1
            ENSG00000238009 R1.17,R2.69,R3.50,P1.51 1
            ENSG00000238009 R1.20,R2.74,R3.17,P1.50 1

        Note: calculates how many cells in input (i.e. only genes and cells that have non-zero value may be counted)

Does R1.02,R2.10,R3.20,P1.51 mean barcode of one single cell ? Does the value mean the expression of each gene for every cell ?

genecode_v19.bed

For gencode_v19.bed (read_gene_models):

def read_gene_models(self):
        """
        Read file with cells on rows and information about them on columns.
        **"cell_name" and "donor" columns are required.**

        Example format (TSV, no header):
            1       .  gene    11869   14362   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; \
                gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; \
                transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

        """

The Example format does not seem to have "cell_name" and "donor" columns

peak_matrix.tsv

My fragments.tsv goes like this:

#chr start end cell value
1 15458 15755 "TAGTCCCCAAACCTAC-1" 1
1 15499 15918 "TCGAGCGGTTCCAATG-1" 1
1 15501 15662 "AGGCCCAGTTCTTAGG-1" 1
1 15541 15751 "CTGGGACTCTGAAAGA-1" 1
1 15547 15721 "TGCTTTAGTTATGCAC-1" 1
1 15555 15731 "GGAATCTTCCCAATAG-1" 1
1 15565 15731 "GAGGTCCGTTGTGAGG-1" 2
1 15565 15759 "GTCCATCTCCGCGATG-1" 2
1 15591 15721 "GAAGTCTTCCACGGCA-1" 2
1 15609 15827 "TAGGTGTTCGAGAAGC-1" 1

but the example like this:

Example format (TSV, no header):

            chr21   15352400        15352499        chr21   15352220        15352427        R1.51,R2.57,R3.95,P1.03
            chr21   15352400        15352499        chr21   15352129        15352459        R1.39,R2.22,R3.07,P1.02

        Last column must be cell ID, first 3 columns must be BED (coordinates of peak)

What does the column 4~6 mean? How could I reshape my data?

Thank you for your help!

diogomribeiro commented 1 year ago

Hi, first I'd like to warn that I've used this script to process the SHARE-seq data, it may need some adaptations for other datasets. For instance, I'm removing the past 3 digits from the barcodes to match RNA-seq and ATAC-seq barcodes, this probably doesn't work with other datasets. Another thing is that we used binary data (expression/atac-seq coded as 0 or 1) To answer your questions: