EVidenceModeler / EVidenceModeler

source code for EVM
BSD 3-Clause "New" or "Revised" License
103 stars 20 forks source link

Erroneous GFF file. #67

Closed fangbohao closed 1 year ago

fangbohao commented 1 year ago

Hi I am writing to consult the reason of an erroneous GFF file checked by "_gff3_gene_prediction_filevalidator.pl".

This GFF file "isoseq_all_four_tissues.fastq.transcript_models_IDfixed.gff" was converted from a GTF file produced by IsoQuant, an transcripts assembler based on long-read RNA seqs. To convert GTF to GFF, I used gfffread and AGAT to correct IDs.

I have attached both GTF and GFF files for your check below.

Could you please help to point out 1) is there anything wrong with the GFF file I tried input? 2) what could be a nice tool to convert the GTF (resulted from IsoQuant) to EVM-adapted GFF? 3) should I assign this GFF file as 'ABINITIO_PREDICTION' in EVM, as it is gene structure already, if it is checked ok?

Thank you very much! Bohao

GTF file outputted from IsoQuant: https://drive.google.com/file/d/1VkC0XwDBSoZA0YscYYhiGDBdkxJLzRDG/view?usp=sharing GFF file converted from the above GTF: https://drive.google.com/file/d/181egnDJBxk9XUwVBu4yf7aDLD0q8JFwn/view?usp=sharing

brianjohnhaas commented 1 year ago

Hi Bohao,

I'll check this out and get back to you.

best,

~b

On Fri, Jan 27, 2023 at 2:39 PM fangbohao @.***> wrote:

Hi I am writing to consult the reason of an erroneous GFF file checked by "gff3_gene_prediction_file_validator.pl http://gff3_gene_prediction_file_validator.pl".

This GFF file " isoseq_all_four_tissues.fastq.transcript_models_IDfixed.gff" was converted from a GTF file produced by IsoQuant, an transcripts assembler based on long-read RNA seqs. To convert GTF to GFF, I used gfffread and AGAT to correct IDs.

I have attached both GTF and GFF files for your check below.

Could you please help to point out 1) is there anything wrong with the GFF file I tried input? 2) what could be a nice tool to convert the GTF (resulted from IsoQuant) to EVM-adapted GFF? 3) should I assign this GFF file as 'ABINITIO_PREDICTION' in EVM, as it is gene structure already, if it is checked ok?

Thank you very much! Bohao

GTF file outputted from IsoQuant: https://drive.google.com/file/d/1VkC0XwDBSoZA0YscYYhiGDBdkxJLzRDG/view?usp=sharing http://url GFF file converted from the above GTF: https://drive.google.com/file/d/181egnDJBxk9XUwVBu4yf7aDLD0q8JFwn/view?usp=sharing http://url

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6BU3BIZPV6JT67463WUQQA7ANCNFSM6AAAAAAUJB7MSA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

brianjohnhaas commented 1 year ago

Here's a converter you can use for your IsoQuant GTF that'll convert it to the GFF3 formatting that EVM prefers: https://github.com/EVidenceModeler/EVidenceModeler/blob/devel/EvmUtils/misc/align_GTF_to_align_GFF3.pl

Just drop it into your EVM software distribution under EvmUtils/misc/ and you can run like so:

EVidenceModeler/EvmUtils/misc/align_GTF_to_align_GFF3.pl 00_isoseq_all_four_tissues.fastq.transcript_models.gtf IsoQuant > IsoQuant.gff3

This would also be fed into EVM listed as 'TRANSCRIPT' evidence type in the EVM weights file.

If you want to include potential coding regions within these IsoQuant structures, you'll want to run TransDecoder on it and get the 'gene structure GFF3 file' format from it, which you can include in EVM as the 'OTHER_PREDICTION' type.

A good example would be our application of TransDecoder to StringTie (but use your IsoQuant here instead): https://github.com/TransDecoder/TransDecoder/tree/master/sample_data/stringtie_example

Just let me know if any of it gives you trouble.

best,

~brian

On Fri, Jan 27, 2023 at 2:50 PM Brian Haas @.***> wrote:

Hi Bohao,

I'll check this out and get back to you.

best,

~b

On Fri, Jan 27, 2023 at 2:39 PM fangbohao @.***> wrote:

Hi I am writing to consult the reason of an erroneous GFF file checked by "gff3_gene_prediction_file_validator.pl http://gff3_gene_prediction_file_validator.pl".

This GFF file " isoseq_all_four_tissues.fastq.transcript_models_IDfixed.gff" was converted from a GTF file produced by IsoQuant, an transcripts assembler based on long-read RNA seqs. To convert GTF to GFF, I used gfffread and AGAT to correct IDs.

I have attached both GTF and GFF files for your check below.

Could you please help to point out 1) is there anything wrong with the GFF file I tried input? 2) what could be a nice tool to convert the GTF (resulted from IsoQuant) to EVM-adapted GFF? 3) should I assign this GFF file as 'ABINITIO_PREDICTION' in EVM, as it is gene structure already, if it is checked ok?

Thank you very much! Bohao

GTF file outputted from IsoQuant: https://drive.google.com/file/d/1VkC0XwDBSoZA0YscYYhiGDBdkxJLzRDG/view?usp=sharing http://url GFF file converted from the above GTF: https://drive.google.com/file/d/181egnDJBxk9XUwVBu4yf7aDLD0q8JFwn/view?usp=sharing http://url

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6BU3BIZPV6JT67463WUQQA7ANCNFSM6AAAAAAUJB7MSA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

fangbohao commented 1 year ago

Hi Brian, thank you so much for clarifying the above method in detail! It works for IsoQuant format conversion now.

I was wondering if you have a solution to convert TOGA TOGA output bed file to a GFF3 that could be fed to EVM? Should I eventually assign TOGA GFF3 under '--transcript_alignments TOGA.gff3' with a setting in 'TRANSCRIPT'?

I attached a TOGA output BED file for your reference below.

TOGA bed file: https://drive.google.com/file/d/11pOKnry8beiQQQ0iCKrKq3_ym2guvhCP/view?usp=sharing

Thank you! Bohao

brianjohnhaas commented 1 year ago

Sure thing. Can you send me one of the genome contigs so that I can verify that the prediction format makes sense -ie. translate correctly into proteins?

any contig will do.

many thanks,

~brian

On Fri, Jan 27, 2023 at 10:34 PM fangbohao @.***> wrote:

Hi Brian, thank you so much for clarifying the above method in detail! It works for IsoQuant format conversion now.

I was wondering if you have a solution to convert TOGA TOGA https://github.com/hillerlab/TOGA#output-reading output bed file to a GFF3 that could be fed to EVM? Should I eventually assign TOGA GFF3 under '--transcript_alignments TOGA.gff3' with a setting in 'TRANSCRIPT'?

I attached a TOGA output BED file for your reference below.

TOGA bed file: https://drive.google.com/file/d/11pOKnry8beiQQQ0iCKrKq3_ym2guvhCP/view?usp=sharing

Thank you! Bohao

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67#issuecomment-1407274478, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXY6PF64O6Y46B7USBTWUSHS3ANCNFSM6AAAAAAUJB7MSA . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

fangbohao commented 1 year ago

Hi Brian, thank you so much for your prompt reply!

Please find a FASTA of Chr 25 from here: https://drive.google.com/file/d/1RPJ4kzRXUQaQBBPpjfORKa-e10k5ifxm/view?usp=sharing

Just in case you want the whole genome: https://s3.amazonaws.com/genomeark/species/Haemorhous_mexicanus/bHaeMex1/assembly_curated/bHaeMex1.pri.cur.20220203.fasta.gz

Thank you! Bohao

brianjohnhaas commented 1 year ago

Perfect. Thanks!

Just drop this into your EVM distro: https://github.com/EVidenceModeler/EVidenceModeler/blob/devel/EvmUtils/misc/gene_BED_to_gene_GFF3.pl

You can then use this with the TOGA bed file:

EVidenceModeler/EvmUtils/misc/gene_BED_to_gene_GFF3.pl query_annotation.bed TOGA > TOGA.gff3

and incorporate it as the'OTHER_PREDICTION'.

best,

~b

On Sat, Jan 28, 2023 at 10:05 AM fangbohao @.***> wrote:

Hi Brian, thank you so much for your prompt reply!

Please find a FASTA of Chr 25 from here: https://drive.google.com/file/d/1RPJ4kzRXUQaQBBPpjfORKa-e10k5ifxm/view?usp=sharing

Just in case you want the whole genome: https://s3.amazonaws.com/genomeark/species/Haemorhous_mexicanus/bHaeMex1/assembly_curated/bHaeMex1.pri.cur.20220203.fasta.gz

Thank you! Bohao

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas

fangbohao commented 1 year ago

Hi Brian, thank you - the TOGA file has been converted to GFF3 very well!

Could you help with my final file conversion (hopefully the final ones)? There are two GTF files outputted from Braker pipeline, which contain genes predicted by AUGUSTUS and GeneMark based on RNA (file "braker_1.gtf") and protein ("braker_1.gtf"). I hope to incorporate them into EVM but failed after trying several perl scripts in EVM so far.

Please find the two files from this link: https://drive.google.com/drive/folders/1-EObbpJB0_cwQzO_S-yUvA7rrmHJj4DX?usp=sharing

Thank you! Bohao

brianjohnhaas commented 1 year ago

sure, I'll take a look. I think we should have something that'll do this already, but we'll see how it goes.

best,

~b

On Sat, Jan 28, 2023 at 2:54 PM fangbohao @.***> wrote:

Hi Brian, thank you - the TOGA file has been converted to GFF3 very well!

Could you help with my final file conversion (hopefully the final ones)? There are two GTF files outputted from Braker pipeline, https://github.com/Gaius-Augustus/BRAKER which contain genes predicted by AUGUSTUS and GeneMark based on RNA (file "braker_1.gtf") and protein ("braker_1.gtf"). I hope to incorporate them into EVM but failed after trying several perl scripts in EVM so far.

Please find the two files from this link: https://drive.google.com/drive/folders/1-EObbpJB0_cwQzO_S-yUvA7rrmHJj4DX?usp=sharing

Thank you! Bohao

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67#issuecomment-1407473826, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXZJAQ6YVLJQYUKILPDWUV2Q7ANCNFSM6AAAAAAUJB7MSA . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

brianjohnhaas commented 1 year ago

Here you go:

https://github.com/EVidenceModeler/EVidenceModeler/blob/devel/EvmUtils/misc/braker_GTF_to_EVM_GFF3.pl

and you'll want to replace this:

https://github.com/EVidenceModeler/EVidenceModeler/blob/devel/PerlLib/GFF3_utils.pm

let me know how it goes, please.

best

~b

fangbohao commented 1 year ago

Hi Brian, thank you so much for your script! I meet the following issues.

1- the '_braker_GTF_to_EVMGFF3.pl' only retained AUGUSTUS genes (shown in the 2nd column of GFF3) and missed GeneMark.hmm3 genes. 2- the resulting GFF3 could not passed the check by '_gff3_gene_prediction_filevalidator.pl' 3- I am using Singularity to run EVM - is there a way to add '_GFF3utils.pm' into it? I have not added this before running the above codes.

Thank you! Bohao

brianjohnhaas commented 1 year ago

ah - good check. Let me address this shortly.

On Sun, Jan 29, 2023 at 11:46 AM fangbohao @.***> wrote:

Hi Brian, thank you so much for your script! I meet the following issues.

1- the 'braker_GTF_to_EVM_GFF3.pl' only retained AUGUSTUS genes (shown in the 2nd column of GFF3) and missed GeneMark.hmm3 genes. 2- the resulting GFF3 could not passed the check by 'gff3_gene_prediction_file_validator.pl http://gff3_gene_prediction_file_validator.pl' 3- I am using Singularity to run EVM - is there a way to add ' GFF3_utils.pm' into it? I have not added this before running the above codes.

Thank you! Bohao

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67#issuecomment-1407712969, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6W65FU7DAKOXRO4DDWU2NETANCNFSM6AAAAAAUJB7MSA . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

brianjohnhaas commented 1 year ago

ok, should work as expected now.

Instead of using the singularity image, see if you can just pull the zip file from here: https://github.com/EVidenceModeler/EVidenceModeler/tree/devel

(under the <>Code button)

and you can run the converter from within that codebase. Nothing needs to be compiled there.

Let me know if this gives any issues.

best,

~b

On Sun, Jan 29, 2023 at 12:01 PM Brian Haas @.***> wrote:

ah - good check. Let me address this shortly.

On Sun, Jan 29, 2023 at 11:46 AM fangbohao @.***> wrote:

Hi Brian, thank you so much for your script! I meet the following issues.

1- the 'braker_GTF_to_EVM_GFF3.pl' only retained AUGUSTUS genes (shown in the 2nd column of GFF3) and missed GeneMark.hmm3 genes. 2- the resulting GFF3 could not passed the check by 'gff3_gene_prediction_file_validator.pl http://gff3_gene_prediction_file_validator.pl' 3- I am using Singularity to run EVM - is there a way to add ' GFF3_utils.pm' into it? I have not added this before running the above codes.

Thank you! Bohao

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67#issuecomment-1407712969, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6W65FU7DAKOXRO4DDWU2NETANCNFSM6AAAAAAUJB7MSA . You are receiving this because you commented.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

fangbohao commented 1 year ago

Hi Brian, thank you for your help along the way - it works well on my EVM running now! Best, Bohao

brianjohnhaas commented 1 year ago

Awesome, thanks!

On Mon, Jan 30, 2023 at 11:09 AM fangbohao @.***> wrote:

Closed #67 https://github.com/EVidenceModeler/EVidenceModeler/issues/67 as completed.

— Reply to this email directly, view it on GitHub https://github.com/EVidenceModeler/EVidenceModeler/issues/67#event-8389902447, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXYWUGNVT5TK3XO7RNDWU7RRXANCNFSM6AAAAAAUJB7MSA . You are receiving this because you commented.Message ID: @.*** com>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas