LRGASP / lrgasp-submissions

Definition and validators for LRGASP submissions
MIT License
8 stars 2 forks source link

unused reads in read-to-model map file format #33

Closed julienlag closed 3 years ago

julienlag commented 3 years ago

In https://github.com/LRGASP/lrgasp-submissions/blob/a0d90276c682e007152b586a154be0fa4f0105dc/docs/read_model_map_format.md: "If a read is not used to generate any transcript model, it may be shown in its second column a *. However, we should check if all the read IDs present in the initial FASTQ file are included or not in this read-model file." I find this formulation rather ambiguous. It is not totally clear if unused read IDs are required in the TSV file, or not. I personally don't think they should, given that (1) their usefulness is not obvious to me, (2) in some pipelines the original reads undergo many filtering steps before being subjected to the final merging algorithm, hence it may be discouraging for participants to recover the IDs of reads that were filtered out far upstream.

diekhans commented 3 years ago

thanks Julien, agree this is not well documented or useful. It is also not enforced by validate. Chanced text to:

"If a read is not used to generate any transcript model it may be omitted or have its transcript_id column specified as *."

Julien Lagarde @.***> writes:

In https://github.com/LRGASP/lrgasp-submissions/blob/a0d90276c682e007152b586a154be0fa4f0105dc/docs/read_model_map_format.md: "If a read is not used to generate any transcript model, it may be shown in its second column a *. However, we should check if all the read IDs present in the initial FASTQ file are included or not in this read-model file." I find this formulation rather ambiguous. It is not totally clear if unused read IDs are required in the TSV file, or not. I personally don't think they should, given that (1) their usefulness is not obvious to me, (2) in some pipelines the original reads undergo many filtering steps before being subjected to the final merging algorithm, hence it may be discouraging for participants to recover the IDs of reads that were filtered out far upstream.