large differences in the number of genes predicted from different varieties of the same species

Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes

Other

350 stars 79 forks source link

large differences in the number of genes predicted from different varieties of the same species #826

Open ChuanzhengWei opened 4 months ago

ChuanzhengWei commented 4 months ago

I have used identical RNA-seq data and protein sequences as inputs for the BRAKER to predict gene models for eight different varieties of the same species. However, the number of gene models produced varies significantly among these varieties. Is such variability normal? How can I assess the accuracy of these genetic models? If this variability is not expected, what steps should I take to address it? The genome sequences were assembled using HiFi reads with the hifiasm tool, and their sizes are approximately the same. Here are my gene numbers: 1.braker.gtf 34487 genes 2.braker.gtf 34415 genes 3.braker.gtf 27988 genes 4.braker.gtf 34992 genes 5.braker.gtf 28213 genes 6.braker.gtf 35597 genes 7.braker.gtf 35041 genes 8.braker.gtf 28085 genes

Thank you for your attention to this matter.

yaoxkkkkk commented 4 months ago

It looks acceptable since there are 5 varieties are ~35000 genes while 3 are ~28000 genes. You can check their phylogenic relationship to see if they could cluster in a similar pattern. Of course you can assess the genesets by BUSCO.

yaoxkkkkk commented 4 months ago

You are right, the consistence between gene number and BUSCO score looks like there are some issues exited, the missing percent is relatively larger. Do you provide all same parameters and external evidence to these varieties? Maybe you can try to provide more specific evidence to 3\5\8 varieties.

I am a newbie on genome assembly and gene annotation, just want to share my opinion :)

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows

From: ChuanzhengWei @.> Sent: Saturday, May 25, 2024 7:07:24 PM To: Gaius-Augustus/BRAKER @.> Cc: Xiukun Yao @.>; Comment @.> Subject: Re: [Gaius-Augustus/BRAKER] large differences in the number of genes predicted from different varieties of the same species (Issue #826)

1: C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614 2: C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614 This is the result of running BUSCO with embryophyta_odb10, the models with fewer predicted genes showed lower C values and larger D values. I reran the BRAKER on the genomes with fewer gene models and obtained similar results. Do you have any better solutions for my situation？I need your help!

— Reply to this email directly, view it on GitHubhttps://github.com/Gaius-Augustus/BRAKER/issues/826#issuecomment-2131215183, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASHAD3XKTHK72TB43YGBZT3ZEBWGZAVCNFSM6AAAAABHXHYJ3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZRGIYTKMJYGM. You are receiving this because you commented.Message ID: @.***>

ChuanzhengWei commented 4 months ago

This is my genome and protein busco results: 1: C:97.7%[S:94.5%,D:3.2%],F:2.0%,M:0.3%,n:1614 C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614
2: C:97.8%[S:94.6%,D:3.2%],F:1.9%,M:0.3%,n:1614 C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:97.5%[S:94.2%,D:3.3%],F:2.2%,M:0.3%,n:1614 C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.4%[S:94.2%,D:3.2%],F:2.4%,M:0.2%,n:1614 C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:96.9%[S:93.7%,D:3.2%],F:2.8%,M:0.3%,n:1614 C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.7%[S:94.4%,D:3.3%],F:2.0%,M:0.3%,n:1614 C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.7%[S:94.2%,D:3.5%],F:2.1%,M:0.2%,n:1614 C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:97.4%[S:94.1%,D:3.3%],F:2.4%,M:0.2%,n:1614 C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614

My scrpits like this:

singularity exec braker.pl --threads=20 --species=ss186 \
    --genome=s186.fasta.masked \
    --prot_seq=annotation_cdhit.fasta \
    --bam=......

It doesn't look like there's something wrong with my genome assembly.

yaoxkkkkk commented 4 months ago

Yes, I agree. I tend to believe that the bias is caused by some variety-specific reasons. The bam file is their own RNA-seq data right? In this case I can’t provide more information.

If you want to improve the BUSCO score, maybe you can try this parameter https://github.com/Gaius-Augustus/BRAKER?tab=readme-ov-file#--busco_lineagelineage

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows

From: ChuanzhengWei @.> Sent: Monday, May 27, 2024 4:24:44 PM To: Gaius-Augustus/BRAKER @.> Cc: Xiukun Yao @.>; Comment @.> Subject: Re: [Gaius-Augustus/BRAKER] large differences in the number of genes predicted from different varieties of the same species (Issue #826)

This is my genome and protein busco results: 1: C:97.7%[S:94.5%,D:3.2%],F:2.0%,M:0.3%,n:1614 C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614 2: C:97.8%[S:94.6%,D:3.2%],F:1.9%,M:0.3%,n:1614 C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:97.5%[S:94.2%,D:3.3%],F:2.2%,M:0.3%,n:1614 C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.4%[S:94.2%,D:3.2%],F:2.4%,M:0.2%,n:1614 C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:96.9%[S:93.7%,D:3.2%],F:2.8%,M:0.3%,n:1614 C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.7%[S:94.4%,D:3.3%],F:2.0%,M:0.3%,n:1614 C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.7%[S:94.2%,D:3.5%],F:2.1%,M:0.2%,n:1614 C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:97.4%[S:94.1%,D:3.3%],F:2.4%,M:0.2%,n:1614 C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614

My scrpits like this:

singularity exec braker.pl --threads=20 --species=ss186 \ --genome=s186.fasta.masked \ --prot_seq=annotation_cdhit.fasta \ --bam=......

It doesn't look like there's something wrong with my genome assembly.

— Reply to this email directly, view it on GitHubhttps://github.com/Gaius-Augustus/BRAKER/issues/826#issuecomment-2132926974, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASHAD3XWYGGL6GXRXKDCJVLZELUUZAVCNFSM6AAAAABHXHYJ3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSHEZDMOJXGQ. You are receiving this because you commented.Message ID: @.***>

Eden-yike commented 3 months ago

I got the same problem like @ChuanzhengWei: the genome size of two hyplotypes of a species are both ~656Mb。but the number of genes are 30967(haplotype1), 36847(haplotype2 ) respectively. And the protein busco score s are 96.8%[S:89.1%,D:7.7%], 96.9%[S:89.2%,D:7.7%].

Eden-yike commented 3 months ago

This is my genome and protein busco results: 1: C:97.7%[S:94.5%,D:3.2%],F:2.0%,M:0.3%,n:1614 C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614 2: C:97.8%[S:94.6%,D:3.2%],F:1.9%,M:0.3%,n:1614 C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:97.5%[S:94.2%,D:3.3%],F:2.2%,M:0.3%,n:1614 C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.4%[S:94.2%,D:3.2%],F:2.4%,M:0.2%,n:1614 C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:96.9%[S:93.7%,D:3.2%],F:2.8%,M:0.3%,n:1614 C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.7%[S:94.4%,D:3.3%],F:2.0%,M:0.3%,n:1614 C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.7%[S:94.2%,D:3.5%],F:2.1%,M:0.2%,n:1614 C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:97.4%[S:94.1%,D:3.3%],F:2.4%,M:0.2%,n:1614 C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614

My scrpits like this:
singularity exec braker.pl --threads=20 --species=ss186 \
    --genome=s186.fasta.masked \
    --prot_seq=annotation_cdhit.fasta \
    --bam=......
It doesn't look like there's something wrong with my genome assembly.

Hi, if this problem has been solved?

ChuanzhengWei commented 3 months ago

This is my genome and protein busco results: 1: C:97.7%[S:94.5%,D:3.2%],F:2.0%,M:0.3%,n:1614 C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614 2: C:97.8%[S:94.6%,D:3.2%],F:1.9%,M:0.3%,n:1614 C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:97.5%[S:94.2%,D:3.3%],F:2.2%,M:0.3%,n:1614 C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.4%[S:94.2%,D:3.2%],F:2.4%,M:0.2%,n:1614 C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:96.9%[S:93.7%,D:3.2%],F:2.8%,M:0.3%,n:1614 C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.7%[S:94.4%,D:3.3%],F:2.0%,M:0.3%,n:1614 C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.7%[S:94.2%,D:3.5%],F:2.1%,M:0.2%,n:1614 C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:97.4%[S:94.1%,D:3.3%],F:2.4%,M:0.2%,n:1614 C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614 My scrpits like this:
singularity exec braker.pl --threads=20 --species=ss186 \
    --genome=s186.fasta.masked \
    --prot_seq=annotation_cdhit.fasta \
    --bam=......
It doesn't look like there's something wrong with my genome assembly.
Hi, if this problem has been solved?

Unfortunately, I have not resolved this issue, and I haven't figured out where the problem lies. Do you have any good suggestions or ideas to solve the problem?

Eden-yike commented 3 months ago

This is my genome and protein busco results: 1: C:97.7%[S:94.5%,D:3.2%],F:2.0%,M:0.3%,n:1614 C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614 2: C:97.8%[S:94.6%,D:3.2%],F:1.9%,M:0.3%,n:1614 C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:97.5%[S:94.2%,D:3.3%],F:2.2%,M:0.3%,n:1614 C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.4%[S:94.2%,D:3.2%],F:2.4%,M:0.2%,n:1614 C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:96.9%[S:93.7%,D:3.2%],F:2.8%,M:0.3%,n:1614 C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.7%[S:94.4%,D:3.3%],F:2.0%,M:0.3%,n:1614 C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.7%[S:94.2%,D:3.5%],F:2.1%,M:0.2%,n:1614 C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:97.4%[S:94.1%,D:3.3%],F:2.4%,M:0.2%,n:1614 C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614 My scrpits like this:
singularity exec braker.pl --threads=20 --species=ss186 \
    --genome=s186.fasta.masked \
    --prot_seq=annotation_cdhit.fasta \
    --bam=......
It doesn't look like there's something wrong with my genome assembly.
Hi, if this problem has been solved?
Unfortunately, I have not resolved this issue, and I haven't figured out where the problem lies. Do you have any good suggestions or ideas to solve the problem?

No,I also don't know where the problem lies.

KatharinaHoff commented 3 months ago

Have you visualized the annotations with evidence in a genome browser? That would be the first thing to do. Maybe jobs failed on some sequence, or maybe there is misleading evidence, hard to say without looking at the data.

On Mon, Jul 1, 2024 at 12:23 PM Eden-yike @.***> wrote:

This is my genome and protein busco results: 1: C:97.7%[S:94.5%,D:3.2%],F:2.0%,M:0.3%,n:1614 C:97.4%[S:95.7%,D:1.7%],F:0.2%,M:2.4%,n:1614 2: C:97.8%[S:94.6%,D:3.2%],F:1.9%,M:0.3%,n:1614 C:97.5%[S:95.7%,D:1.8%],F:0.5%,M:2.0%,n:1614 3: C:97.5%[S:94.2%,D:3.3%],F:2.2%,M:0.3%,n:1614 C:94.4%[S:92.5%,D:1.9%],F:0.3%,M:5.3%,n:1614 4: C:97.4%[S:94.2%,D:3.2%],F:2.4%,M:0.2%,n:1614 C:97.5%[S:95.5%,D:2.0%],F:0.2%,M:2.3%,n:1614 5: C:96.9%[S:93.7%,D:3.2%],F:2.8%,M:0.3%,n:1614 C:94.9%[S:92.0%,D:2.9%],F:0.2%,M:4.9%,n:1614 6: C:97.7%[S:94.4%,D:3.3%],F:2.0%,M:0.3%,n:1614 C:97.3%[S:95.4%,D:1.9%],F:0.5%,M:2.2%,n:1614 7: C:97.7%[S:94.2%,D:3.5%],F:2.1%,M:0.2%,n:1614 C:97.5%[S:95.4%,D:2.1%],F:0.2%,M:2.3%,n:1614 8: C:97.4%[S:94.1%,D:3.3%],F:2.4%,M:0.2%,n:1614 C:93.8%[S:91.8%,D:2.0%],F:0.4%,M:5.8%,n:1614 My scrpits like this:

singularity exec braker.pl --threads=20 --species=ss186 \ --genome=s186.fasta.masked \ --prot_seq=annotation_cdhit.fasta \ --bam=......

It doesn't look like there's something wrong with my genome assembly.

Hi, if this problem has been solved?

Unfortunately, I have not resolved this issue, and I haven't figured out where the problem lies. Do you have any good suggestions or ideas to solve the problem?

No,I also don't know where the problem lies.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/826#issuecomment-2199777780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JFSSUQJ24IKKSQ3SV3ZKEU35AVCNFSM6AAAAABHXHYJ3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZG43TONZYGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>