LRGASP / lrgasp-submissions

Definition and validators for LRGASP submissions
MIT License
8 stars 2 forks source link

read id to model format #11

Closed fairliereese closed 3 years ago

fairliereese commented 4 years ago

I am wondering if the expectation for this file is to include reads from all analyzed replicates in the same file, or if we're expecting users to provide separate files for each sample.

Additionally, how will we handle transcript IDs that are not present in the corresponding GTF? For instance, in my case TALON assigns a transcript model to each read, but we filter models so that only some of them make the cut for the GTF. The filtered-out model names currently remain in the read id to model format but I'm curious how we should approach them.

julienlag commented 4 years ago

My opinion is that:

  1. there should be one read-to-model file per model file. In other words if you submitted sample1_replicate1_models.gtf, you should submit a sample1_replicate1_readToModel.tsv file. If, on the other hand, you submitted your models based on merged replicates, then you should submit sample1_models.gtf and sample1_readToModel.tsv.

  2. _readToModel.tsv should contain all transcript_id's contained in _models.gtf. We probably also want all transcript_id's in _readToModel.tsv to be also present in _models.gtf. This way we would easily detect mis-paired files during validation (validation would fail at the first mismatch).

  3. on a related note: currently https://github.com/diekhans/lrgasp-submissions/blob/master/docs/reads_transcript_map_format.md states that "If a read is not used at all to generate any transcript model, it may be show in its second column a *". Does that mean we would require all read IDs present in the original FASTQ to be also present in _readToModel.tsv? "Rescuing" read IDs that didn't make it into any model may be cumbersome for some workflows and I don't really see the point in listing them in _readToModel.tsv anyway. We could instead assume that any read ID absent from _readToModel.tsv was pre-filtered/unused as evidence for model building.

Does that make sense?

fairliereese commented 4 years ago

Got it, yes this makes sense and seems to be a logical approach! Thanks for the clarification.

julienlag commented 4 years ago

WRT file validation, we should check that all transcript_id's present in the model GTF are present at least once in the map file

diekhans commented 3 years ago

validation code now requires all models to be supported by at least one read.