ctlab / LinSeed

Linseed: LINear Subspace identification for gene Expresion Deconvolution
MIT License
28 stars 8 forks source link

Hello! Could you post a quick tutorial on how to format a linseed object? #7

Open methornton opened 5 years ago

methornton commented 5 years ago

Hello!

I am working through the tutorial and I have my own RNA-seq data that I would like to process with linseed. Does the LinseedObject function require data be formatted exactly as "GSE19830_series_matrix.txt"? I have an RNA-seq data set that has annotation for genes , raw counts, and RPKM. I don't know how many cell types are present, but I expect at least 10 -12.

Can you tell me which of these fields must be supplied?

Fields:

     ‘exp’ List of two elements raw and normalized gene expression
          dataset

     ‘name’ Character, optional, dataset name

     ‘cellTypeNumber’ Identified cell type number, required for
          projection, corner detection and deconvolution

     ‘projection’ Projection of genes into space lower-dimensionality
          (presumably simplex)

     ‘endpoints’ Simplex corners (in normalized, non-reduced space)

     ‘endpointsProjection’ Simplex corners (in reduced space)

     ‘distances’ Stores distances for every gene to each corner in
          reduced space

     ‘markers’ List that stores signatures genes for deconvolution, can
          be set manually or can be obtained by ‘selectGenes(k)’

     ‘signatures’ Deconvolution signature matrix

     ‘proportions’ Deconvolution proportion matrix

     ‘pairwise’ Calculated pairwise collinearity measure

The header of my RNA-seq data looks like this:

EnsemblID   EntrezID    RGD_ID  Geneme  GeneType    logFC   logCPM  LR  PValue  FDR SA33599_rev SA33601_rev SA33604_rev SA33598_rev SA33600_rev SA33602_rev SA33603_rev SA33605_rev SA33606_rev SA33598_rev_RPKM    SA33599_rev_RPKM    SA33600_rev_RPKM    SA33601_rev_RPKM    SA33602_rev_RPKM    SA33603_rev_RPKM    SA33604_rev_RPKM    SA33605_rev_RPKM    SA33606_rev_RPKM    Chr Strand  length  NoExons RNACentralID    miRBaseID   miRBaseACC  TM_Helix    HAMAP_ID    Description
ENSRNOG00000005609  29458   3165    Neurod1 protein_coding  -4.41557073893638   5.09105209110567    111.392747290707    4.85365557971023E-26    7.76293673418854E-22    174 218 11  16  41  27  42  388 5   0.720808668436819   13.0576466284548    1.93454971657025    10.9567107210054    1.75455632681648    1.57289902939802    0.773458305076102   14.4906203372679    0.33082393003395    3   -1  5248    3                       neuronal differentiation 1 [Source:RGD Symbol;Acc:3165]
ENSRNOG00000003680  25451   2650    Gabrb2  protein_coding  -4.82293017899498   4.31972433520164    107.686834920917    3.14786664687739E-25    2.51734895750785E-21    98  134 5   6   144 14  25  225 3   0.672937124992248   18.3090140777356    16.9153796849254    16.7668593078227    2.2649301153254 2.33085245162443    0.875260735086981   20.919966761681 0.494164322054507   10  1   2108    10              TMhelix     gamma-aminobutyric acid type A receptor beta 2 subunit [Source:RGD Symbol;Acc:2650]

I can get the 'normCounts' out from the R package 'edgeR', if this is necessary, how to format it? Any advice or assistance is greatly appreciated!! Thank you!

pushtiks commented 4 years ago

Hi! I'm also trying/testing linseed and used CPMs (from edgeR), TPMs (from RSEM) and also FPKM (cufflinks) matrices.

Matrices looked like: transcript_id sample1 sample2 sample3 <--------header ENST000000000 5.456 7.876 4.194 <-------- transcript/gene id and it's expression values per sample in CPMs/TPMs/FPKMs

The expected cell type number entered by hand into R script. Idk, if linseed allows to add more than one number simultaneously. I just tried different expected numbers per each script run.

By now my results are not as beautiful as they could be.

Some more detailed tutorial is appreciated! :)

konsolerr commented 4 years ago

@methornton

You can just provide the expression matrix to a constructor of the Linseed Class (basically matrix objects) I would suggest using something like TPMs, any normalization that already took library size into an account.

Cheers and sorry for the slow replies, Konstantin