LRGASP / lrgasp-submissions

Definition and validators for LRGASP submissions
MIT License
8 stars 2 forks source link

Gene vs. transcript level expression #17

Open fairliereese opened 4 years ago

fairliereese commented 4 years ago

I know I keep asking a bunch of questions that are probably specific for the pipeline we use, sorry if these are mostly irrelevant for others...

In our case, we use filtered and unfiltered abundance files to quantify transcripts and genes respectively. This is because our pipeline does not try to assign incomplete transcript reads to known transcript models and instead creates their own transcript models. These models don't often pass our filter so we don't perform transcript-level quantification on them, but we still do use them for gene-level quantification.

What would be the recommended course of action here? Especially since including the unfiltered transcripts in the expression matrix as it exists now would yield transcript entries that do not have a corresponding gene entry as we exclude them from the models.gtf file.

I also can't recall if performing gene-level quantification is even one of the challenges that we're going to be scoring and if it isn't this question is irrelevant.

julienlag commented 4 years ago

According to https://github.com/diekhans/lrgasp-submissions/blob/master/docs/expression_matrix_format.md "Gene expression will be calculated summing up the expression values of all the transcripts coming from the same locus." I think that means we don't expect participants to submit gene expression values. Instead, the evaluation pipeline will build gene models out of transcript model submissions and calculate GE values by summing up TE values, if I understand right. You're raising a valid point which would be worth discussing, however

fairliereese commented 4 years ago

Perhaps we can allow for users to optionally include a list of transcript models to consider as their actual high-confidence models? That way they can provide gene to transcript id information in the GTF for models that definitely belong to a specific gene but are not good enough to be considered models on their own, and can still use them for gene-level quantification.