Open LliliansCalvo opened 6 months ago
It sounds the code involved in your issue has been updated after v1.0.0. Could you give a try with the latest version (v1.4.0)? You may go more straithgforward:
agat config --expose --output_format GTF
agat_sp_merge_annotations.pl --gff agat_braker.gff3.gtf --gff agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf --out agat_result.gtf
The sanitization is made by all script with _sp_
prefix, so no need to use agat_convert_sp_gxf2gxf.pl excepted if you want to keep track of the intermediate sanitized files (before to be merged).
@Juke34: I hit the same issue with AGAT 1.3.3. I'll try reproducing with the latest version and will report back.
We are still getting an error when trying to build a custom Cell Ranger index based on the AGAT-merged GTF, even when using v.1.4.0. I'll try to provide more details tomorrow.
Could you share a sample of the data used?
I suspect the problematic entries come from this GFF file which defines some pairs of genes with the exact same coordinates and structure but different IDs.
Following this post in parallel https://www.biostars.org/p/9601455/#9601546 I see two potential issues: AGAT creating mRNA while transcripts is expected And in our case hereI think it is related to multiple values attributes.
To avoid to have mRNA instead of transcript in the output you should avoid converting in "relax" mode that allow all type of feature type in column 3.
To avoid this problem adapt the AGAT config file using agat config --expose --output_format GTF --gtf_output_version 3
I need the merged annotation in both, GTF and GFF format (for different downstream tools). Currently I am running agat_sp_merge_annotations.pl
with a config that contains
output_format: GFF
gff_output_version: 3
followed by agat_convert_sp_gff2gtf.pl
with a config that contains
gtf_output_version: 3
How would you recommend to change that and what makes you think the mrna/transcript issue is the problem when clearly I get entries containing [...]; gene_id "name1" "name2"; [...]
and CellRanger complains about quotation within tag values?
There are probably two issues. One is the fact that you have multiple value attributes like gene_id "name1" "name2"; I should check that it is really problematic before adding a trick into AGAT. And the mRNA feature that supposed to be transcript.
Could you send me over the fasta file corresponding to the GFF file you provided?
I am not sure which FASTA file you refer to. I have two annotations, the GFF one I linked above and one GTF I am not at liberty to share that I merge into a GFF and convert to GTF in a second step, bth using AGAT as described above. AFAICT, the problematic entries result from the public annotation, so you should be able to reproduce the issue without our confidential one.
I should check that it is really problematic before adding a trick into AGAT.
The Cell Ranger error message is pretty clear on that since it specifically mentions unexpected quotation withing the value.
@Juke34: Is there anything I can do to troubleshoot this further or do I need to give up on using AGAT and find an alternative approach to generate my GTF?
@LliliansCalvo: Would you mind sharing your post-processing script that enabled you to feed AGAT's output into Cell Ranger?
@mschilli87 I can easily add patch to avoid gene_id attribute to have several values but I do need the GFF sample that provided the
JAPIVC010000802.1 . gene 525047 606909 . + . gene_id "evm.model.ctg61.39_evm" "evm.model.ctg61.40"; transcript_id "evm.model.ctg61.39_evm.model.ctg61" "evm.model.ctg61.40.1.5d03f9f9"; ID "evm.model.ctg61.39_evm"; gene_name "evm.model.ctg61.39_evm"; ortholog_ID "TITIN_DROME" "E2AEH5_CAMFO";
GTF output.
@Juke34: I am not sure what else I need to provide for you fix this. AFAICT, the GFF I already linked above is causing the quoted attributes cellranger complains about.
I wanted the fasta to make some test directly with cell ranger. I can share a patch with you that you can try otherwise
@mschilli87 Could you try the code from branch 457 ? And let me know if the error left?
Thank you. Will test and let you know. Else I'll track down the FASTA and share a public link.
I am still seing some multi-value tags (incl. tag
, gene_name
, & transcript_name
) with this version (I do get a lot of the 'keeping only the first one' messages though).
edit: Could this be related to me having agat_sp_merge_annotation.pl
write GFFe output and converting this GFF3 file to GTF for CellRanger using agat_convert_sp_gff2gtf.pl
(from the same branch!) in a second step?
Yes I remove double attributes only for gene_id
and transcript_id
as it is the only sensible attributes to get a consistent file.
P.S: Merging annotation should merge 2 isoforms in one only and only if they are 100% identical. But multi-values is not something problematic in general, except for ID/Parent. attributes in GFF and gene_id, transcript_id where it should be forbidden to keep file consistency.
Does it work with cellranger?
@Juke34:
Does it work with cellranger?
No. It seems to be throwing off there parser. A line with double tag
value triggers an error about unexpected quotes in gene_name
, even though only one value is given in that particular field.
@Juke34
Yes I remove double attributes only for gene_id and transcript_id as it is the only sensible attributes to get a consistent file.
Also, would it be possible to preserve that information somewhere? In our case, one gene_id might be an ENSEMBL Gene ID while the other is just a random number. While I don't expect AGAT to pick the more useful one, it would be great to be able to check if the random number ID masks a well know gene if it comes up as interesting in a downstream analysis. Maybe an altarnative IDs attribute or at least mentioning the IDs in a parsable way in the warning messages. This would also make it easier now to quantify the problem and trace some of the cases upstream to figure out why some genes have shadow copies in that annotation.
In the current code it is saved in agat_other_gene_id
and agat_other_transcript_id
attribute accordingly.
I still don't get how you end up with several values in gene_id and transcript_id, I should definitely fix that if it is AGAT that makes this mistake.
Now that I have a better understanding how to identify these cases I can trace them back more easily and potentially provide you with a more minimal breaking input.
OK. Here is an update: In the GFF I linked above there are records with multiple tags. You'll find lines ending in
...; tag "basic"; tag "Ensembl_canonical"
After merging them with our annotation into GFF3, those lines are reformatted by AGAT to contain
...;tag=basic,Ensembl_canonical;...
From there we use AGAT fto convert the GFF3 to GTF where we get
...; tag "basic" "Ensembl_canonical"; ...
Which seems to be throwing off CellRanger:
[error] mkref has failed: error building reference package
Error while parsing GTF file <path-to-GTF-file>
Error parsing GTF at line 8. Parsed attribute had a quote in the middle of a value. Please ensure quotes are only used to encapsulate attribute values.
Bad Attribute Value = transcript_name
Note that the erro points towards the transcript_name
attribute which is valid in line 8:
chrM AGAT gene 577 647 . + . gene_id "ENSG00000210049.1"; transcript_id "ENST00000387314.1"; ID "ENSG00000210049.1"; gene_name "MT-TF"; gene_type "misc_RNA"; hgnc_id "HGNC:7481"; level "3"; tag "basic" "Ensembl_canonical"; transcript_name "MT-TF-201"; transcript_support_level "NA"; transcript_type "misc_RNA";
However, when substituting ; tag "basic" "Ensembl_canonical";
with ;tag "basic"; tag "Ensembl_canonical";
, CellRanger accepts this line and complains about another one further down the line.
Regarding the issue with several IDs values, I am still investigating and hope this turns out to be an error on my end somehow.
Update 2: I definitely still get multiple gene IDs after merging.
Input GFF:
chrM ENSEMBL transcript 3230 3304 . + . gene_id "26266"; transcript_id "26266"; gene_type "misc_RNA"; gene_name "26266"; transcript_type "misc_RNA"; transcript_name "26266"; level 3;
chrM ENSEMBL exon 3230 3304 . + . gene_id "26266"; transcript_id "26266"; gene_type "misc_RNA"; gene_name "26266"; transcript_type "misc_RNA"; transcript_name "26266"; level 3;
chrM ENSEMBL transcript 3230 3304 . + . gene_id "ENSG00000209082.1"; transcript_id "ENST00000386347.1"; gene_type "misc_RNA"; gene_name "MT-TL1"; transcript_type "misc_RNA"; transcript_name "MT-TL1-201"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:7490"; tag "basic"; tag "Ensembl_canonical";
chrM ENSEMBL exon 3230 3304 . + . gene_id "ENSG00000209082.1"; transcript_id "ENST00000386347.1"; gene_type "misc_RNA"; gene_name "MT-TL1"; transcript_type "misc_RNA"; transcript_name "MT-TL1-201"; exon_number 1; exon_id "ENSE00002006242.1"; level 3; transcript_support_level "NA"; hgnc_id "HGNC:7490"; tag "basic"; tag "Ensembl_canonical";
Output GFF3:
chrM AGAT gene 3230 3304 . + . ID=agat-gene-295;gene_id=26266,ENSG00000209082.1;gene_name=26266,MT-TL1;gene_type=misc_RNA;hgnc_id=HGNC:7490;level=3;tag=basic,Ensembl_canonical;transcript_id=26266,ENST00000386347.1;transcript_name=26266,MT-TL1-201;transcript_support_level=NA;transcript_type=misc_RNA
chrM ENSEMBL transcript 3230 3304 . + . ID=26266;Parent=agat-gene-295;gene_id=26266;gene_name=26266;gene_type=misc_RNA;level=3;merged_ID=ENST00000386347.1;merged_Parent=ENSG00000209082.1;merged_gene_id=ENSG00000209082.1;merged_gene_name=MT-TL1;merged_gene_type=misc_RNA;merged_hgnc_id=HGNC:7490;merged_level=3;merged_tag=basic,Ensembl_canonical;merged_transcript_id=ENST00000386347.1;merged_transcript_name=MT-TL1-201;merged_transcript_support_level=NA;merged_transcript_type=misc_RNA;transcript_id=26266;transcript_name=26266;transcript_type=misc_RNA
Thank you, this helps a lot. I will see what I can don when I will be back from vacation.
agat_sp_merge_annotations.pl output incompatible with cellranger I have two gff files I want to merge. One is an old annotation, and the other is a new annotation I have just made using braker3.
In order to merge them i am using agat_1.0.0 in the singularity container. Here is the code with all the steps I have done:
As you can see agat generates 2 gene_id and 2 transcript_id for this transcript. I have fixed manually this one but this also happens for other genes. Hope you can help ! Thanks !!