NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

`agat_sp_extract_sequences.pl`: support for multicistronic transcripts. #451

Open jdcla opened 2 months ago

jdcla commented 2 months ago

Is your feature request related to a problem? Please describe. Currently agat_sp_extract_sequences.pl (could be other scripts as well) does not support multicistronic transcripts. While this feature is often not supported by various gtf/gff tools, studies increasingly indicate the existence of translated ORFs positioned upstream/downstream/... of canonical coding sequences.

Describe the solution you'd like When running agat_sp_extract_sequences.pl, I would like agat_sp_extract_sequences.pl to be able to handle multiple CDSs defined per transcript/mRNA feature. To start of, the tool would evaluate CDS IDs rather than transcript IDs as fasta headers (see this issue). Currently, I think the tool ignores or merges multicistronic CDSs with identical transcript IDs.

Describe alternatives you've considered Today, it's possible to define a unique mRNA feature for each CDS, similar to the solution described here. It's a hacky solution that fails to show that multiple CDSs are from the same transcript.