NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
465 stars 56 forks source link

AGAT combine with high duplicated BUSCOs #407

Closed ld9866 closed 11 months ago

ld9866 commented 1 year ago

Dear developer: We annotate the genome using BREAKER and LIFTOVER respectively, we first use agat_sp_ensembl_output_style.pl to convert the annotation into the same format, and then use agat_sp_merge_annotations. However, we found that repeated comment information may have been introduced in the BUSCO evaluation. May I ask how to modify it?

LIFTOVER: Results: c:93, 7% [s:39 .3%,D:54 , 4%, F:2, 5%,M:3. 8%,n:3354 3145 Complete BUSCOs (C) 1319 and single-copy BUSCOs (S)Complete 1826 Complete and duplicated BUSCOs (D) 85 Fragmented BUSCOs (F) 124 Missing BUSCOs (M) 3354 Total BUScO groups searched

LIFTOVER + BREAKER: Results:

C:98.6%[S:16.0%,D:82.6%],F:0.8%,M:0.6%,n:3354      
3305    Complete BUSCOs (C)            
535 Complete and single-copy BUSCOs (S)    
2770    Complete and duplicated BUSCOs (D)     
28  Fragmented BUSCOs (F)              
21  Missing BUSCOs (M)             
3354    Total BUSCO groups searched        
Juke34 commented 1 year ago

When run-in BUSCO on annotation, it does not differentiate from which gene the protein comes from. So if you have an annotation where each gene is = 1 mRNA = 1 protein then the duplicate number represent duplicated genes. But nowadays most of the annotation tools infer isoforms: 1 gene = several mRNA = several proteins. In that case it is not possible to differentiate Busco duplicate if they are from gene duplicate or due to isoforms. When running agat_sp_merge_annotations , if two loci from 2 annotations have CDS overlapping (in the same direction) and the mRNA is unique compared to the other mRNAs of the same locus, AGAT will merge the loci and create mRNA isoforms accordingly. If you wish to have a real view of BUSCO duplicates you should first filter the annotation and keep only one representative per locus via agat_sp_keep_longest_isoform.pl, prior to run BUSCO.

zuodabin commented 1 month ago

Can I use agat_sp_merge_annotations.pl to directly combine gff files from braker, miniprot, and transdecoder to produce annotations for final genome assembly? Without the need to use maker, EVM and other integrated software.I am currently experiencing the same problem as above, with high duplication in BUSCO after merging three gff files using agat_sp_merge_annotations

Juke34 commented 1 month ago

Yes you can but this script does not replace like EVM that could produce in a locus a new gene model based on the gene models provided on that locus, or select (pick up) one gene model rather than another in a locus based on some logic. AGAT keep everything, and will merge overlapping gene models in one resulting in many isoforms in a studied locus. Even without merging annotation you can end up with high duplication in BUSCO. This is related to the presence of isoforms. In some notation tools you can activate or deactivate the discovery of isoforms. But using "agat_sp_merge_annotations" you anyway end up with isoforms. To avoid high duplication in BUSCO a common approach is to select only one isoform by locus. Within AGAT you can use agat_sp_keep_longest_isoform.pl.

zuodabin commented 1 month ago

Hello author, thank you very much for your reply. I have used agat_sp_keep_longest_isoform.pl to remove duplicates, but the result from BUSCO is only 82%. However, before using this script, BUSCO was 96% complete despite the high duplication. What should I do? What if I just use the agat_sp_merge_annotations script to do downstream analysis after combining three gff files?

------------------ 原始邮件 ------------------ 发件人: "NBISweden/AGAT" @.>; 发送时间: 2024年10月16日(星期三) 晚上8:28 @.>; @.**@.>; 主题: Re: [NBISweden/AGAT] AGAT combine with high duplicated BUSCOs (Issue #407)

Yes you can but this script does not replace like EVM that could produce in a locus a new gene model based on the gene models provided on that locus, or select (pick up) one gene model rather than another in a locus based on some logic. AGAT keep everything, and will merge overlapping gene models in one resulting in many isoforms in a studied locus. Even without merging annotation you can end up with high duplication in BUSCO. This is related to the presence of isoforms. In some notation tools you can activate or deactivate the discovery of isoforms. But using "agat_sp_merge_annotations" you anyway end up with isoforms. To avoid high duplication in BUSCO a common approach is to select only one isoform by locus. Within AGAT you can use agat_sp_keep_longest_isoform.pl.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Juke34 commented 1 month ago

That reflect that the longest does not represent the best choice in a locus. Maybe one of the annotation you are merging worse than others and contain lot of over-prediction (very long and merging several locus in one). You might choose to run a BUSCO on the different annotations independently, and choose to merge only the two best ones. Then you can "complement" with the worses. Complement do not add isofoforms in a locus but will add new locus not annotated in the reference annotation.

zuodabin commented 1 month ago

Thanks for your reply. Could I directly use " agat_sp_complement_annotations.pl " to merge according to your opinion? agat_sp_complement_annotations.pl -- ref for transcripts. Fasta. Transdecoder. Genome. Gff3 -- add miniprot. Gff3 - add braker.gff3 -o final.gff3 1187697462 @. ---- Replied Message ---- FromJacques @.>Date10/16/2024 @.>@.>, @.>SubjectRe: [NBISweden/AGAT] AGAT combine with high duplicated BUSCOs (Issue #407) That reflect that the longest does not represent the best choice in a locus. Maybe one of the annotation you are merging worse than others and contain lot of over-prediction (very long and merging several locus in one). You might choose to run a BUSCO on the different annotations independently, and choose to merge only the two best ones. Then you can "complement" with the worses. Complement do not add isofoforms in a locus but will add new locus not annotated in the reference annotation. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Juke34 commented 1 month ago

Yes but it does not behave the same. Have a look here: https://x.com/JacquesDainat/status/1770094882448646402/photo/1