NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

agat_sp_merge_annotations.pl output incompatible with cellranger #457

Closed LliliansCalvo closed 1 week ago

LliliansCalvo commented 1 month ago

agat_sp_merge_annotations.pl output incompatible with cellranger I have two gff files I want to merge. One is an old annotation, and the other is a new annotation I have just made using braker3.

In order to merge them i am using agat_1.0.0 in the singularity container. Here is the code with all the steps I have done:

# GFF to GTF using agat
singularity exec agat_1.0.0--pl5321hdfd78af_0.sif   agat_convert_sp_gxf2gxf.pl  --gff braker.gff3  -o agat_braker.gff3.gtf

singularity exec agat_1.0.0--pl5321hdfd78af_0.sif   agat_convert_sp_gxf2gxf.pl  --gff replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf                 -o agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf 

# Merge the two files

singularity exec   agat_1.0.0--pl5321hdfd78af_0.sif  agat_sp_merge_annotations.pl         --gff agat_braker.gff3.gtf         --gff agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf --out agat_sanitized_sp_merge_braker3_old_annotation

head agat_sanitized_sp_merge_braker3_old_annotation
##gff-version 3
JAPIVC010000919.1   .   gene    556 8241    .   +   .   ID=evm.model.ctg717.1;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   mRNA    556 8241    .   +   .   ID=nbis-mrna-11951;Parent=evm.model.ctg717.1;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    556 1010    .   +   .   ID=exon-92011;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    1430    1864    .   +   .   ID=exon-92012;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    2267    2447    .   +   .   ID=exon-92013;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    3334    3475    .   +   .   ID=exon-92014;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    5986    6266    .   +   .   ID=exon-92015;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    6682    6893    .   +   .   ID=exon-92016;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1
JAPIVC010000919.1   .   exon    7047    7328    .   +   .   ID=exon-92017;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1

# Doesn't  work!
cellranger mkref  --genome=C_fellah  --fasta=mod_GCA_030586385.1_ASM3058638v1_genomic.fna  --genes=agat_sanitized_sp_merge_braker3_old_annotation

[error] mkref has failed: error building reference package
Error while parsing GTF file agat_sanitized_sp_merge_braker3_old_annotation
Property 'transcript_id' not found in GTF line 4: JAPIVC010000919.1 .   exon    556 1010    .   +   .   ID=exon-92011;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1

awk 'NR==4' agat_sanitized_sp_merge_braker3_old_annotation
JAPIVC010000919.1   .   exon    556 1010    .   +   .   ID=exon-92011;Parent=nbis-mrna-11951;gene_id=evm.model.ctg717.1;ortholog_ID=E2AHL2_CAMFO;transcript_id=evm.model.ctg717.1

# To try and solve this I then did:

singularity exec   agat_1.0.0--pl5321hdfd78af_0.sif   agat_sp_manage_attributes.pl --gff agat_sanitized_sp_merge_braker3_old_annotation --att gene_id/gene_name --cp  -o step1_agat_sanitized_sp_merge_braker3_old_annotation

singularity exec agat_1.0.0--pl5321hdfd78af_0.sif  agat_convert_sp_gff2gtf.pl  --gff step1_agat_sanitized_sp_merge_braker3_old_annotation -o step2_agat_sanitized_sp_merge_braker3_old_annotation

head step2_agat_sanitized_sp_merge_braker3_old_annotation
##gtf-version 3
JAPIVC010000296.1   .   gene    4567    9586    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "evm.model.ctg150.1"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   transcript  4567    9586    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "nbis-mrna-12890"; Parent "evm.model.ctg150.1"; gene_name "evm.model.ctg150.1"; original_biotype "mrna"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    4567    4848    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41546"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    4966    5096    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41545"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    5166    5284    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41544"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    5380    5504    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41543"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    5858    6127    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41542"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    7027    7154    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41541"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";
JAPIVC010000296.1   .   exon    7246    7642    .   -   .   gene_id "evm.model.ctg150.1"; transcript_id "evm.model.ctg150.1"; ID "exon-41540"; Parent "nbis-mrna-12890"; gene_name "evm.model.ctg150.1"; ortholog_ID "E2AZM0_CAMFO";

# Run cellranger again
 cellranger mkref  --genome=C_fellah  --fasta=mod_GCA_030586385.1_ASM3058638v1_genomic.fna  --genes=step2_agat_sanitized_sp_merge_braker3_old_annotation

[error] mkref has failed: error building reference package
Error while parsing GTF file step2_agat_sanitized_sp_merge_braker3_old_annotation
Error parsing GTF at line 8024.  Parsed attribute had a quote in the middle of a value.  Please ensure quotes are only used to encapsulate attribute values.
 Bad Attribute Value = transcript_id 

awk 'NR==8024' step2_agat_sanitized_sp_merge_braker3_old_annotation
JAPIVC010000802.1   .   gene    525047  606909  .   +   .   gene_id "evm.model.ctg61.39_evm" "evm.model.ctg61.40"; transcript_id "evm.model.ctg61.39_evm.model.ctg61" "evm.model.ctg61.40.1.5d03f9f9"; ID "evm.model.ctg61.39_evm"; gene_name "evm.model.ctg61.39_evm"; ortholog_ID "TITIN_DROME" "E2AEH5_CAMFO";

As you can see agat generates 2 gene_id and 2 transcript_id for this transcript. I have fixed manually this one but this also happens for other genes. Hope you can help ! Thanks !!

Juke34 commented 4 weeks ago

It sounds the code involved in your issue has been updated after v1.0.0. Could you give a try with the latest version (v1.4.0)? You may go more straithgforward:

agat config  --expose --output_format GTF
agat_sp_merge_annotations.pl   --gff agat_braker.gff3.gtf    --gff agat_replaced_chromosomes_evmodPasaWccl_w_orthologs.gtf --out agat_result.gtf

The sanitization is made by all script with _sp_ prefix, so no need to use agat_convert_sp_gxf2gxf.pl excepted if you want to keep track of the intermediate sanitized files (before to be merged).