VCCRI / Ularcirc

An R-shiny app that provides backsplice and canonical splicing analysis for both circular RNA (circRNA) and parental transcripts
GNU General Public License v3.0
15 stars 7 forks source link

Gene Count Table Requirements Missing #8

Closed DarioS closed 4 years ago

DarioS commented 4 years ago

Could there be a specification provided of what format a valid gene count table must have? I have gene counts from RSEM and I am wondering how to convert them into a format that Ularcirc requires.

$ head OC1.genes.results # Using GENCODE Genes
gene_id transcript_id(s)        length  effective_length        expected_count  TPM     FPKM
ENSG00000000003.14      ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4       2229.13 2077.88 2361.00 6.97    24.07
ENSG00000000005.6       ENST00000373031.5,ENST00000485971.1     1205.00 1053.75 4.00    0.02    0.08
ENSG00000000419.12      ENST00000371582.8,ENST00000371584.8,ENST00000371588.9,ENST00000413082.1,ENST00000466152.5,ENST00000494752.1     1078.13 926.88  583.00  3.86    13.33
ENSG00000000457.14      ENST00000367770.5,ENST00000367771.11,ENST00000367772.8,ENST00000423670.1,ENST00000470238.1      3750.92 3599.66 565.00  0.96    3.33
ENSG00000000460.17      ENST00000286031.10,ENST00000359326.9,ENST00000413811.3,ENST00000459772.5,ENST00000466580.6,ENST00000472795.5,ENST00000481744.5,ENST00000496973.5,ENST00000498289.5      2727.37 2576.12    94.00   0.22    0.77
ENSG00000000938.13      ENST00000374003.7,ENST00000374004.5,ENST00000374005.8,ENST00000399173.5,ENST00000457296.5,ENST00000468038.1,ENST00000475472.5   1925.87 1774.62 60.00   0.21    0.72
ENSG00000000971.15      ENST00000359637.2,ENST00000367429.8,ENST00000466229.5,ENST00000470918.1,ENST00000496761.1,ENST00000630130.2     3523.08 3371.83 1724.99 3.14    10.84
ENSG00000001036.13      ENST00000002165.10,ENST00000367585.1,ENST00000451668.1  2255.20 2103.94 693.00  2.02    6.98
ENSG00000001084.13      ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.6,ENST00000514004.5,ENST00000514373.3,ENST00000514933.2,ENST00000515580.1,ENST00000616923.5,ENST00000643939.1,ENST00000650454.1        2066.31 1915.08 703.00  2.25    7.78

Also,

For full functionality at least one FSJ, one BSJ, and one gene count data set be loaded per sample.

What are other possible combinations of uploaded files, and which kinds of analysis can be done if those combinations are uploaded? What is the reduced functionality that this sentence hints at?

davhum commented 4 years ago

Sorry for delayed reply.

Short answer: The gene count file is not used for a lot of the core functionality of Ularcirc- rather some of the QC plots (eg PCA) where BSJ are normalised against a CPM. The most important files are the FSJ and BSJ files. So perhaps in this instance you might consider not uploading the gene count file.

=============================================================

Gene count table format::

Ularcirc only accepts gene counts in format as generated by STAR aligner. i.e:

gene name \t unstranded \t Forward strand counts \t Reverse strand counts

unstranded counts should equal forward strand + reverse strand (but this is not checked)

Most illumina libraries will generated counts to negative strand (i.e. opposite strand of gene model). I see that the RSEM does not generate a raw count output which is what Ularcirc expects. Ularcirc calculate a CPM from raw counts to normalisation BSJ counts. If you pass the RSEM TPM or FPKM values then ularcirc will incorrectly generate a CPM. However this will only affect PCA plots that are generated on a CPM option.

davhum commented 4 years ago

I am closing this issue now, but am happy to discuss other functionality that could be used with the gene count file.