Confidence score / database suggestion?

gavinmdouglas commented 2 years ago

Hi there,

Thanks for making this tool - there's definitely a need for a consistent framework for calling pseudogenes across disparate prokaryotes, and this seems like a great method to do that.

While testing it on an E. coli genome (GCA_000529555.1) I noticed that a very large number of pseudogenes were identified: 2322 pseudogenes compared with 4504 intact genes.

The input files were the prokka outputted GenBank file (with the options that you suggest, such as --compliant).

This is the exact pseudofinder (v1.1.0) command I used:

pseudofinder.py annotate --genome /data1/gdouglas/projects/hgt_fragments/prokka_output_rerun/Escherichia_coli/GCA_000529555.1/GCA_000529555.1.gbk \
                         --outprefix GCA_000529555.1 \
                         --database /data1/gdouglas/db/UniRef/uniref90.dmnd \
                         --compliant \
                         --threads 16 \
                         --diamond

Do you think the UniRef database I used and/or the fact that I'm aligning with DIAMOND could lead to an elevated number of pseudogenes? More generally, are there any confidence score metrics or other values in the outfiles that I could use to help identify high confidence pseudogenes?

Thanks,

Gavin

Arkadiy-Garber commented 2 years ago

Hi Gavin,

Thanks for your interest in pseudofinder, and for this note!

It does, indeed, seem unusual for this E. coli genome to contain so many pseudogenes! After inspecting the NCBI assembly page for this entry, it looks to be excluded from RefSeq, with the stated reason "many frameshifted proteins".

So it looks like NCBI's PGAP software also detected a high amount of pseudogenes. Digging further into the publication associated with this genome assembly, I see that it was sequenced using the IonTorrent platform, which is notoriously bad at resolving homopolymers, resulting in artifactual indels, resulting in false frameshifts. Here is one relevant article about it, but if you google something like 'ion torrent frameshift indel', you can see more articles.

With regard to a confidence score metric for pseudogene prediction, this is something that we are currently working on, but this score would be most useful if you have a closely related genome that you can provide to pseudofinder, via the --reference flag.

Hope this helps clear things up! Don't hesitate to reach out if you have other questions or issues with the software.

Thanks, Arkadiy

gavinmdouglas commented 2 years ago

Thanks a lot Arkadiy, that's very helpful.

I guess I was just unlucky to choose that genome as a first test - I ran pseudofinder on three other E. coli genomes and there are < 50% as many pseudogenes.

These are the results for the additional three genomes:

Genome           Num_pseudogenes    Num_intact
GCF_000446685.2     950                  4567
GCF_002461645.1      1019                 4559
GCF_004804205.1      898                 4238

Still seems a bit high (e.g., the NCBI annotation report for GCF_004804205.1 includes 107 pseudogenes) though. Do you think the use of the UniRef90 database (vs some other reference database) could be driving this result?

Thanks,

Gavin

Edit: I noticed that the column names were wrong when I looked back at this - fixed now

Arkadiy-Garber commented 2 years ago

Thanks for the note, Gavin. I will check these genomes out. Could you please send me 1) the command that you used, and 2) either the uniref90 database or a link where I can find it.

gavinmdouglas commented 2 years ago

Thanks again for the fast response.

1) The commands were all run as done in my first message (but with one thread)

pseudofinder.py annotate --genome $ACCESSION.gbk \
                         --outprefix $ACCESSION \
                         --database /data1/gdouglas/db/UniRef/uniref90.dmnd \
                         --compliant \
                         --threads 1 \
                         --diamond

Where $ACCESSION is the genome accession.

2) You can find the database file (and corresponding FASTA) I used here: https://drive.google.com/drive/folders/1HL4pxNlOzbkCjvIUvnR6HuZYLs-3puXa?usp=sharing

Thanks,

Gavin

gavinmdouglas commented 2 years ago

Here are the GenBank input files I used as well: https://drive.google.com/file/d/1HrFYkfOBqLb5vZF6Y06TJuAVAVzzkshR/view?usp=sharing

gavinmdouglas commented 2 years ago

Just a quick note that the column names were wrong in the table I posted (likely obvious, but the column with smaller counts was supposed to be the pseudogenes... Fixed now). Hopefully that didn't cause any confusion.

Arkadiy-Garber commented 2 years ago

Hi Gavin,

Thanks for the note, and apologies for the delay in getting to this. I will look into this within the next few days.

Cheers, Arkadiy

gavinmdouglas commented 2 years ago

Aha, I noticed that most pseudogene hits in the output GFF have this listed in the description: "Reason(s):Intergenic region with N blast hits." (where N is an integer generally between 5 - 20).

After eliminating all hits with this in the description the counts are much more reasonable (although not for GCA_000529555.1 still, which makes sense based on what you found):

GCA_000529555.1 1142 GCF_000446685.2. 169 GCF_002461645..1. 229 GCF_004804205.1. 154

So I'm thinking that if I eliminate these pseudogene candidates that this will likely leave me with a higher quality set? The other "Reasons" tend to be "Predicted fragmentation of a single gene" and "ORF is X% of the average length of hits to this gene", which seem like they might be more reliable. What do you think?

I'm also thinking that just because I may be using a larger database than what you tested with that it may be easier for there to be an absolute number of BLAST hits that match to an intergenic region, which would explain why I'm identifying so many likely spurious pseudogenes based on that criteria, whereas it seems you did not in your validations.

Thanks,

Gavin

Arkadiy-Garber commented 2 years ago

Hi Gavin,

Yes, that makes sense. Sorry about not being able to get to this in a timely manner, and I appreciate you following up.

This is something that we have discussed before - intergenic regions that accumulate enough BLAST hits from a reference database will be reported as pseudogenes. If you are using a database as comprehensive as the non-redundant database from NCBI, you may see more intergenic regions reported as pseudogenes. Whether these are actually pseudogenes (e.g. very ancient remnants of what were once functional genes) or false positives (random BLAST hits) is something that we are still thinking about.

In any case, you can control how many of these intergenic regions are reported by changing the values for the following flags: --evalue and --intergenic_length.

Hope this helps clear things up. Let me know if you have any questions about this.

Thanks! Arkadiy

gavinmdouglas commented 2 years ago

Thanks Arkadiy. I think I'll just entirely ignore that class of pseudogenes since I'm a little skeptical of them (unless you strongly disagree).

Thanks for the help!

Gavin

filip-husnik / pseudofinder

Confidence score / database suggestion? #28