Open josemduarte opened 7 years ago
As the name EPPIC stands for protein-protein interface classifier, I would say that it is not that bad if we do not provide a call for nucleotide assemblies (I mean NOPREDs everywhere).
Probably we can easily identify double stranded DNA helices, and produce a BIO call for them, but an extension of the algorithm would be needed together with some benchmarking to make up a call for other less clear cases.
What happens for nucleotide-protein interfaces?
Yes definitely I'd go for NOPREDs as the best solution. I wouldn't even bother calling the DNA double strands bio, the nopred offers a more honest assessment.
Note that for nucleotide-protein interfaces we are able to score them based on the protein side only.
In any case for assemblies we'd need to catch those that are made of exclusively nucleotide chains and assign the NOPREDs to them. For assemblies with mixed protein/nucleotide chains, in principle we can score them based at least on some of the interfaces.
I'd go so far as to say that we should not generate assemblies that include an all-NOPRED interface. Then it's like we ignore nucleotide-nucleotide interfaces completely.
I agree with @sbliven. It would be a good solution to ignore them, since we cannot score them properly.
This issue has a related one: what do we do when more than one high scoring assembly is present in the crystal? Do we call all of them BIO? I have implemented a solution where I call all the high scoring assemblies the same (as NOPRED, but it can be changed), and the other lower scoring as XTAL.
@sbliven also proposed to choose the assembly with the lowest stoichiometry as BIO and others as XTAL. I will create a pull request with my solution, but it can be changed to what we agree.
The example that made me thing about it, although the tie is due to the DNA interface, is 2rt8:
# Topologically valid assemblies in 2rt8 id Interf cluster ids Size Stoichiometry Symmetry Score Predicted by 1 {} 1,1 A,A C1,C1 0.50 2 {1} 2 A B C1 0.50 pdb1
These cases should be very rare though.
I think that with the current solution (see PR #166) we can make the release, so I will assign the further discussion of this issue to the 3.1 milestone.
We said we could implement a very naive scoring for nucleotide only interfaces. We could calculate the number of base pairs between the two chains (using the distances between H-bonding atoms) and then give probability one if there are more than, say, two bp, or probability zero if there are less.
In DNA structures we can't use any of our indicators for contacts between nucleotide chains, even geometry doesn't apply for nucleotide chains. For interface scoring we solved it by having nopreds for contacts between nucleotide chains. But for assembly scoring we have no solution at the moment, e.g. an assembly of 2 chains of double-stranded DNA is called XTAL. Can we do better than that?
A good example is 2rt8 (NMR), now in dev server: it has only 1 interface between the 2 strands of DNA which we call nopred, but then the assembly that we call bio is the single strand, the double strand is called xtal. Any ideas for a way to treat this? Nopreds everywhere? a warning?