LRGASP / lrgasp-submissions

Definition and validators for LRGASP submissions
MIT License
8 stars 2 forks source link

consistency of transcript_id's in model GTF vs expression matrix #16

Closed julienlag closed 3 years ago

julienlag commented 4 years ago

transcript_id's in the expression matrix are required to match the transcript_id's in the model GTF. Therefore both files should proceed from the exact same input: <sample>_models.gtf and <sample>_expressionMatrix.tsv in case of merged replicates, or <sample>_<replicate>.gtf and <sample>_<replicate>_expressionMatrix.tsv. In other words, if a participant submits a per-replicate model GTF, they should also submit per-replicate expression values. Similarly, if merged-replicates GTF, merged-replicates expression values.

There may also be conflicting transcript_id's across samples. For example transcript_id PB.10.1 may have been used in two completely different samples and correspond to two totally different transcript structures. In that case we'd have trouble differentiating them in the expression matrix, which seems overly error-prone. Wouldn't it be simpler to require one expression file per GTF, or alternatively ask for expression values to be included directly in the GTF (e.g. as a 9th field attribute)?

diekhans commented 3 years ago

ids are now validated