Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
347 stars 79 forks source link

Annotating three genomes of different fungal strains (Edited) #423

Closed mhfk closed 2 years ago

mhfk commented 2 years ago

Hi,

I would like to ask your opinion on the workflow to annotate 3 genome assemblies with different qualities in a fungal species. I have genome A (assembled from long + short reads), B (assembled from short reads) and C (assembled from short reads) where the order of assembly quality (in terms of number of contigs) is A>B>C.

So far, I have annotated A & C by using RNAseq data and obtained more 23k genes in A and 27k genes in C. I suspect the significant discrepancy was due to genome C being too contiguous. I am planning to supply the final augustus hints from genome A in both annotation run in genome C and B. Is that possible in BRAKER2? Is this the correct way do approach this?

This is my command for both run braker.pl --species=fungal_X --cores=16 --augustus_args="--species=X" --softmasking --fungus --gff3 --genome=genome_A --bam= RNA-on-genome_A.bam

EDIT: I just realized that there is a pipeline where I can just use a pre-trained parameters to predict genes. After trying this workflow, I got similar number of genes. What is your opinion on this?. Also, why do bam files still needed for this workflow?

KatharinaHoff commented 2 years ago

You can do this. For example, you can combine the predicted proteins from A with the suitable orthodb fraction for fungi, and run BRAKER2 with proteins. Then combine the output with the BRAKER1/RNAseq output using TSEBRA.

However, that does not guarantee that you get similar numbers for genes and transcripts in A, B and C. That might be impossible.

On Thu, Sep 9, 2021 at 4:08 AM mhfk @.***> wrote:

Hi,

I would like to ask your opinion on the workflow to annotate 3 genome assemblies with different qualities in a fungal species. I have genome A (assembled from long + short reads), B (assembled from short reads) and C (assembled from short reads) where the order of assembly quality (in terms of number of contigs) is A>B>C.

So far, I have annotated A & C by using RNAseq data and obtained more 23k genes in A and 27k genes in C. I suspect the significant discrepancy was due to genome C being too contiguous. I am planning to supply the final augustus hints from genome A in both annotation run in genome C and B. Is that possible in BRAKER2? Is this the correct way do approach this?

This is my command for both run braker.pl --species=fungal_X --cores=16 --augustus_args="--species=X" --softmasking --fungus --gff3 --genome=genome_A --bam= RNA-on-genome_A.bam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/423, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JGP4WA7MV3VL3YUDV3UBAJI5ANCNFSM5DWCFTYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

tomasbruna commented 2 years ago

Hi @mhfk,

I'm answering the questions in the edit (I'm not sure Katharina saw the edits).

I just realized that there is a pipeline where I can just use a pre-trained parameters to predict genes. After trying this workflow, I got similar number of genes. What is your opinion on this?

If all the assemblies are from the same species, it makes sense to train on genome A and use the model for predictions in B an C.

Also, why do bam files still needed for this workflow?

The hints from the bam file are used not only during model training but also during prediction with pre-trained parameters. You can run the prediction with pre-trained parameters without a bam file, but this will likely result in worse predictions (if you do not use proteins instead, according to Katharina's answer).

mhfk commented 2 years ago

Thank you for both of your answers! @tomasbruna @KatharinaHoff

I am going forward with the TSEBRA workflow and so far the results seem promising. I'm just curious if there is an way that I can extract amino acid sequences from the output of TSEBRA? (Using getAnnoFasta.pl just give me the coding sequence)

Thank you for the attention and answers but most importantly thank you for developing these tools!

KatharinaHoff commented 2 years ago

getAnnoFastaFromJoingenes.py for example does that.

On Mon, Sep 13, 2021 at 8:12 AM mhfk @.***> wrote:

Thank you for both of your answers! @tomasbruna https://github.com/tomasbruna @KatharinaHoff https://github.com/KatharinaHoff

I am going forward with the TSEBRA workflow and so far the results seem promising. I'm just curious if there is an way that I can extract amino acid sequences from the output of TSEBRA? (Using getAnnoFasta.pl just give me the coding sequence)

Thank you for the attention and answers but most importantly thank you for developing these tools!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/423#issuecomment-917873238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JD4RB36RIFEPH4PF6DUBWI4RANCNFSM5DWCFTYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mhfk commented 2 years ago

Perfect. Thank you for your help @KatharinaHoff !