baoxingsong / AnchorWave

sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism and whole-genome duplication variation
MIT License
142 stars 19 forks source link

Empty CDS file from gff2seq #36

Open nwespe opened 1 year ago

nwespe commented 1 year ago

Hello,

I am getting an empty CDS file from running gff2seq with no error message. My command is anchorwave gff2seq -i GCF_003119195.2_ASM311919v2_genomic.gff -r GCF_003119195.2_ASM311919v2_genomic.fna -o cds.fa The genomic data is directly from NCBI https://www.ncbi.nlm.nih.gov/genome/95891?genome_assembly_id=1473191

I am running it on Ubuntu 20.04.4 in a conda environment after only installing anchorwave.

baoxingsong commented 1 year ago

Hi, thanks for giving AnchorWave a try. The format of GCF_003119195.2_ASM311919v2_genomic.gff is different from what AnchorWave expects. Every gene has the same ID "gene-". I am not sure I could parse this GFF file even manually.

Do you have the GFF file for another genome, please?

matthewwiese commented 9 months ago

@baoxingsong Hi Baoxing, some of us in the Buckler Lab recently encountered this and it caused quite a bit of confusion since AnchorWave provides no error or message to explain why the CDS file is empty.

Would it be possible to add a message (e.g. "Please check your GFF for genes with a unique ID") when gff2seq writes an empty CDS file?

matthewwiese commented 8 months ago

I've received correspondence related to this open issue, and want to post further details on how I solved my situation in case it helps others.

After comparing with a working GFF, I discovered that the non-working GFF lacked a gene type (there were only CDS and mRNA), whereas the working GFF associated each mRNA to a gene with the same start/stop/strandedness etc.

For others, inspect your non-working GFF and compare how it associates CDS, mRNA, and gene together (paying particular attention to the attributes detailed in column 9):

##gff-version   3
chr7    CLEAN   gene    25420269        25421320        .       -       .       ID=Pgl_GLEAN_10006696;Name=Pgl_GLEAN_10006696;
chr7    CLEAN   mRNA    25420269        25421320        .       -       .       ID=Pgl_GLEAN_10006696.mrna;Parent=Pgl_GLEAN_10006696;Name=Pgl_GLEAN_10006696.mrna;
chr7    GLEAN   CDS     25421320        25421713        .       -       0       ID=cds1;Parent=Pgl_GLEAN_10006696.mrna;
chr7    GLEAN   CDS     25420562        25420635        .       -       2       ID=cds2;Parent=Pgl_GLEAN_10006696.mrna;
chr7    GLEAN   CDS     25420153        25420269        .       -       0       ID=cds3;Parent=Pgl_GLEAN_10006696.mrna;

Props to @agostof for help with this

jgroh commented 3 months ago

Hi,

I've encountered the above issue with several different assemblies from independent groups downloaded directly from NCBI, for example one here. There is nothing obviously wrong with the GFF files, and in this case it doesn't lack a gene category as suggested above.

Here's the command I'm using

anchorwave gff2seq -r GCA_029852735.1_ASM2985273v1_genomic.fna -i genomic.gff -o test

There is no error message, giving the impression the command worked, but the resulting file is empty. It's unclear to me what about the GFF format of this file Anchorwave doesn't like.

Any suggestions?

matthewwiese commented 3 months ago

@jgroh My guess is that AnchorWave's regex is stumbling on the pipes | in your GFF's ID attribute. I ran the command you provided on the GFF from NCBI you linked after modifying it:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM2985273v1
#!genome-build-accession NCBI_Assembly:GCA_029852735.1
##sequence-region CM056809.1 1 97805473
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3435
CM056809.1  Genbank region  1   97805473    .   +   .   ID=CM056809.1:1..97805473;Dbxref=taxon:3435;Name=1;chromosome=1;collected-by=Onkar Nath;country=Australia: Queensland;cultivar=Hass;dev-stage=Mature Tree;gbkey=Src;genome=chromosome;isolate=MH2022;mol_type=genomic DNA;tissue-type=Leaves
CM056809.1  Genbank gene    55876   68562   .   -   .   ID=gene-MRB53_000001;Name=MRB53_000001;gbkey=Gene;gene_biotype=protein_coding;locus_tag=MRB53_000001
CM056809.1  Genbank mRNA    55876   68562   .   -   .   ID=rna-gnl;Parent=gene-MRB53_000001;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1  Genbank exon    68481   68562   .   -   .   ID=exon-gnl|WGS:JAQQXO|g1.t1-1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1  Genbank exon    67843   67935   .   -   .   ID=exon-gnl|WGS:JAQQXO|g1.t1-2;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1  Genbank exon    67005   67292   .   -   .   ID=exon-gnl|WGS:JAQQXO|g1.t1-3;Parent=rna-gnl|WGS:JAQQXO|g1.t1;gbkey=mRNA;locus_tag=MRB53_000001;orig_protein_id=gnl|WGS:JAQQXO|cds-35697;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein
CM056809.1  Genbank CDS 68481   68562   .   -   0   ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
CM056809.1  Genbank CDS 67843   67935   .   -   2   ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1
CM056809.1  Genbank CDS 67005   67292   .   -   2   ID=cds-KAJ8646978.1;Parent=rna-gnl|WGS:JAQQXO|g1.t1;Dbxref=NCBI_GP:KAJ8646978.1;Name=KAJ8646978.1;gbkey=CDS;locus_tag=MRB53_000001;orig_transcript_id=gnl|WGS:JAQQXO|g1.t1;product=hypothetical protein;protein_id=KAJ8646978.1

The mRNA value was changed from ID=rna-gnl|WGS:JAQQXO|g1.t1; to ID=rna-gnl;.

And this is the result I get with AnchorWave v1.2.3:

>rna-gnl gene-MRB53_000001
ATGGAAGGACCCCACCGCACCATCTATCATGAGTGGGCCACATTTGGTGAAGGGCCCACTACTAGTAACCAGGACGTTACCAAAGAAGAAGAATCTGACTTGGGGTGTGTCCGCTTCCGCCTTTTGAATTGGGCAAATCACAGCAGGCAGCAAGCAATGTTTGGTTCCACGAATTCTTTTGGGCAGTCTTCAAGTAGCCCATTTGGAACCCAATCAGTTTTTGGGCAGACAAACAATGCTACCACCAATCCTTTTGCTCCCAAACCCTTTGGTAGTACAAGCCCTTTTGGTTCCCAGACTGGCAGTTCTGTATTTGGTGGCACATCAACTGGTGTATTTGGAACACCACAATCTTCTACACCAGCGTTTGGTGCTTCGACGGCACCCGCTTTTGGAAGCTCGGTGCCTGGTTTTGGGGCGCCATCAACTCCGTCTTTTGGTAGTGCTTCGTCCTCTTTTGGTG

You might have to manually edit your GFF unless @baoxingsong updates the regular expression for you.

matthewwiese commented 3 months ago

I took a second look at this, and it's actually not just the pipes but the name itself. For example, none of the following ID values for the mRNA line work:

rna-gnlt1
rna-gnl.t1
rna-gnl:g1

Only rna-gnl results in a successful execution given my tiny subset of your GFF. Baoxing will have to chime in. I am not familiar enough with the GFF spec to understand what AnchorWave is doing here.

matthewwiese commented 3 months ago

OK yeah I do think it's a regex issue with the pipes | and other ignored characters.

My hunch is that the CDS parsing of Parent=rna-gnl|WGS:JAQQXO|g1.t1; breaks on the pipes etc, such that the parsed value is rna-gnl and the rest is lost. That fits with Baoxing's regex from here - online regex demo.

Edit: So to clarify, the previous three values from this comment actually do work. You just have to update the CDS entries to match. The regex finds them just fine, so long as there isn't a pipe.

Zhaoqiyue5 commented 3 weeks ago

Hello, I have a question for you. When I run the code anchorwave gff2seq -i SL5.0.gff3-r CDs.fa > Anchorwave.log 2>&1, there is no error but the generated cds file is empty. And my gff3 file and fasta file have the same chromosome names from Chr01 to Chr12 respectively