NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers #450

Open jdcla opened 2 months ago

jdcla commented 2 months ago

Describe the bug According to the documentation, the headers created by the script are formatted:

ID gene=gene_ID name=NAME seq_id=Chromosome_ID type=cds 5'extra=VALUE

However, when applying this script to extract sequences of CDS features the header id's contain the id of the mRNA feature, rather than that of the selected feature CDS.

e.g. >transcript:ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds instead of >CDS:ENSP00000431562 gene=gene:ENSG00000182378 seq_id=X type=cds

General (please complete the following information): v1.4 Singularity Ubuntu Linux

To Reproduce Simply run the script on any gff3 file containing ID fields in the CDS attribute fields.

E.g., using https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/. agat_sp_extract_sequences.pl -g Homo_sapiens.GRCh38.110.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o cdss.fa -t cds

Expected behavior Use the CDS ID in the header rather than the transcript/mRNA ID.

Additional context Somewhat off-topic, but I was trying to apply this tool on gff3 files containing multiple CDS ID's per mRNA (multicistronic). It seems this is currently not supported.

Juke34 commented 2 months ago

Sounds fair. I would suggest to keep transcript ÌD because in case of isoform would be difficult to guess from which transcript the CDS comes from:

>CDS:ENSP00000431562 transcript=ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds

For the multicistronic problem, this has never been taken into account... Please open another issue that it can be discussed (At least other user can realize also this AGAT's limitation)

Juke34 commented 2 months ago

CDS chunks may share the same identifier, in this case how to differentiate the different extracted CDS chunks? I guess we should add in the descritption the chunck number or something like that. What do you think about it @jdcla ?

jdcla commented 1 month ago

I'm not entirely sure what exactly CDS chunks refers to. Are you referring to chunks as existent on different exons?

Juke34 commented 1 month ago

Yes.
A CDS is a single feature that can exist over multiple genomic locations (in case of multi exons genes). So several CDS features (lines in the GFF) can be needed to create the biological CDS feature.

jdcla commented 1 month ago

Ok. I'm not familiar enough with annotation formats and conventions to know what the best approach is in case it's important to list what chunks a CDS is constructed from. I was simply thinking that it would make sense to list the identifier used for the CDS in the header of the fasta file if these are present in the gff file.