Open jdcla opened 7 months ago
Sounds fair. I would suggest to keep transcript ÌD because in case of isoform would be difficult to guess from which transcript the CDS comes from:
>CDS:ENSP00000431562 transcript=ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds
For the multicistronic problem, this has never been taken into account... Please open another issue that it can be discussed (At least other user can realize also this AGAT's limitation)
CDS chunks may share the same identifier, in this case how to differentiate the different extracted CDS chunks? I guess we should add in the descritption the chunck number or something like that. What do you think about it @jdcla ?
I'm not entirely sure what exactly CDS chunks refers to. Are you referring to chunks as existent on different exons?
Yes.
A CDS is a single feature that can exist over multiple genomic locations (in case of multi exons genes). So several CDS features (lines in the GFF) can be needed to create the biological CDS feature.
Ok. I'm not familiar enough with annotation formats and conventions to know what the best approach is in case it's important to list what chunks a CDS is constructed from. I was simply thinking that it would make sense to list the identifier used for the CDS in the header of the fasta file if these are present in the gff file.
Describe the bug According to the documentation, the headers created by the script are formatted:
However, when applying this script to extract sequences of CDS features the header id's contain the id of the mRNA feature, rather than that of the selected feature CDS.
e.g.
>transcript:ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds
instead of>CDS:ENSP00000431562 gene=gene:ENSG00000182378 seq_id=X type=cds
General (please complete the following information): v1.4 Singularity Ubuntu Linux
To Reproduce Simply run the script on any gff3 file containing ID fields in the CDS attribute fields.
E.g., using https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/.
agat_sp_extract_sequences.pl -g Homo_sapiens.GRCh38.110.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o cdss.fa -t cds
Expected behavior Use the CDS ID in the header rather than the transcript/mRNA ID.
Additional context Somewhat off-topic, but I was trying to apply this tool on gff3 files containing multiple CDS ID's per mRNA (multicistronic). It seems this is currently not supported.