agat_sp_extract_sequences.pl extract all CDS on gene level

fuesseler commented 3 months ago

Is your feature request related to a problem? Please describe. I have been trying to use agat_sp_extract_sequences.pl to extract all CDS over several transcripts of a single gene, but as far as I can see there is no direct way to achieve this with the current options for this command in AGAT.

Describe the solution you'd like Be able to specify that I want to extract all CDS based on per gene level and not only on per transcript level. Otherwise, if you have a suggestion how to "hack" this problem (or if there are reasons why in general this would be a bad/problematic idea), I would be grateful!

Describe alternatives you've considered I considered extracting the CDS separately (using --split) for each transcript of a gene, concatenating them together, while purging "shared" CDS between transcripts somehow ...

Additional context The reason why I want to do this, is because in the next steps I want to determine orthologues (with OMA) and then generate MSAs. Currently, I am running into the problem that very often OMA groups transcripts of a gene together that have divergent CDS in the beginning or end - which then leads to alignment issues. So, the thought was, if all CDS from all transcripts of a gene are in the input fastas, these misalignments should get resolved.

Grateful if you have any ideas about this!

Juke34 commented 2 months ago

There is no option to collapse isoforms into a chimeric transcript. The best way to achieve this is to use bedtools: https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

fuesseler commented 2 months ago

Thanks for confirming this isn't currently possible with AGAT! In the end rather than constructing a chimeric transcirpt, I opted to take care of my issue further downstream (by cleaning my alignments from non-homologous exonic falsely aligned regions with HmmCleaner).

NBISweden / AGAT

agat_sp_extract_sequences.pl extract all CDS on gene level #474