filip-husnik / pseudofinder

Detection of pseudogene candidates in bacterial and archaeal genomes.
GNU General Public License v3.0
42 stars 16 forks source link

False Postive with known complete gene #52

Closed LijMeh closed 1 year ago

LijMeh commented 1 year ago

Firstly, thanks for developing this awesome bit of software, it's exactly what I needed for the analysis I've been doing, and has been an absolute lifesaver.

With that being said, I have recently run into an issue that is somewhat vexing: I'm looking at pseudogenes across P. aeruginosa strains, and in almost every strain the transcriptional regulator LasR is being identified as a pseudogene, with the call Reason(s):ORF is 64.3%25 of the average length of hits to this gene. (as an aside not sure what the 25 after the % is about). However, in each strain where this is the case, the gene itself is full length (720bp), and I can go into the pseudogene fasta file and see that the full length is being identified (the annotated protein in the .gbff file is full length as well).

When I look at the proteome.faa.blastP_output I can see that the top hit is LasR, with the following scores:

PA14_45960 | 100.000 | 239 | 0 | 0 | 1 | 239 | 1 | 239 | 0.0 | 497 | ref\|YP_791822.1\|gi\|116049375\| transcriptional regulator LasR

I totally understand that there are going to be some false positives, but I couldn't understand the process by which this specific gene was being called as a pseudogene, and I wanted to make sure I wasn't making some mistake that would have ramifications for the rest of my results.

(I ran pseudofinder.py at default settings against a number of .gbff files for circular P. aeruginosa genomes which were downloaded directly from NCBI).

Let me know if I can provide any information that would be helpful in troubleshooting this, and thanks in advance for the help.

LijMeh commented 1 year ago

Sorry, just realizing this as I was looking through the documentation: but does this have to do with Step 1 from the wiki?

Truncated - The length of a CDS is compared against the average length of its blast hits.

I noticed that about about 6 other genes are being returned as partial matches to the LasR protein in the proteome.faa.bastP_output file, all with relatively poor alignments. However, if I divide the protein length by the average of the alignment length across all these proteins I get ~64%, which seems to match with the truncated call I quoted in the original question.

I would have assumed that, in the case where the top hit is a 100% match it should just call it as complete, rather than comparing it to the average of the other hits, which, I think, might be what is happening here?

mitchso commented 1 year ago

Hi there,

You are correct in troubleshooting this! This is a tricky problem to address from our end. What it boils down to is that we do not have a way to be certain that in a given database, reference proteins are also not pseudogenes. Because of this we have chosen to always look at averages across hits, and also allow the user to tune blast parameters like e-value cutoff.

In your specific case with a 100% match to a known protein I would agree that it should not be flagged as a pseudogene, but we have not implemented or tested a general solution. ie if we were to implement an override mechanism, what would be the required information and what would be limits? We haven't explored that.

For now, you could take a look at the e-values for the problematic hits and adjust the parameter to eliminate them?

Also if you have suggestions let us know and we will aim to include them in future developments.

Best, Mitch

LijMeh commented 1 year ago

Thanks for the suggestions, I'll check out the e-value for the offending match and try changing the cutoffs.

After looking more into the issue, it seems like I just got unlucky with a single 900bp gene that shares just enough sequence overlap with the other regulatory genes I'm looking at to skew the average of every gene that returns it as a blast result.

I'm not sure if this would be a feasible approach, or really worth the time since this specific issue seems to be an edge case, but it might be nice to look at the lengths of the sequences that are being returned as blast results and exclude outliers. For example, in this case, my gene of interest is matching to 6 other genes, 5 of which are the same length and only one of which is significantly longer. (But again, maybe just changing the evalue is a better approach to this.)

mitchso commented 1 year ago

Totally agree that filtering outliers could be a viable approach - we have thought about different ways to do it but chose to leave it out for now, since again one big problem is that it is very context dependent. Is there a definition of outlier that could be applied generally across blast results for different genes and databases?

With time I think we could arrive at some solution that would limit false positives without also throwing away useful data, but for now we decided to instead keep the decisions as simple and explicit as possible by providing the table of all blast info and then listing a specific reason for the call at each pseudogene.