DNA assembly scoring problems

eppic-team / eppic

:white_check_mark::x:Evolutionary protein-protein interface classifier

http://eppic-web.org

Other

8 stars 3 forks source link

DNA assembly scoring problems #161

Open josemduarte opened 7 years ago

josemduarte commented 7 years ago

In DNA structures we can't use any of our indicators for contacts between nucleotide chains, even geometry doesn't apply for nucleotide chains. For interface scoring we solved it by having nopreds for contacts between nucleotide chains. But for assembly scoring we have no solution at the moment, e.g. an assembly of 2 chains of double-stranded DNA is called XTAL. Can we do better than that?

A good example is 2rt8 (NMR), now in dev server: it has only 1 interface between the 2 strands of DNA which we call nopred, but then the assembly that we call bio is the single strand, the double strand is called xtal. Any ideas for a way to treat this? Nopreds everywhere? a warning?

lafita commented 7 years ago

As the name EPPIC stands for protein-protein interface classifier, I would say that it is not that bad if we do not provide a call for nucleotide assemblies (I mean NOPREDs everywhere).

Probably we can easily identify double stranded DNA helices, and produce a BIO call for them, but an extension of the algorithm would be needed together with some benchmarking to make up a call for other less clear cases.

lafita commented 7 years ago

What happens for nucleotide-protein interfaces?

josemduarte commented 7 years ago

Yes definitely I'd go for NOPREDs as the best solution. I wouldn't even bother calling the DNA double strands bio, the nopred offers a more honest assessment.

Note that for nucleotide-protein interfaces we are able to score them based on the protein side only.

In any case for assemblies we'd need to catch those that are made of exclusively nucleotide chains and assign the NOPREDs to them. For assemblies with mixed protein/nucleotide chains, in principle we can score them based at least on some of the interfaces.

sbliven commented 7 years ago

I'd go so far as to say that we should not generate assemblies that include an all-NOPRED interface. Then it's like we ignore nucleotide-nucleotide interfaces completely.

lafita commented 7 years ago

I agree with @sbliven. It would be a good solution to ignore them, since we cannot score them properly.

This issue has a related one: what do we do when more than one high scoring assembly is present in the crystal? Do we call all of them BIO? I have implemented a solution where I call all the high scoring assemblies the same (as NOPRED, but it can be changed), and the other lower scoring as XTAL.

@sbliven also proposed to choose the assembly with the lowest stoichiometry as BIO and others as XTAL. I will create a pull request with my solution, but it can be changed to what we agree.

The example that made me thing about it, although the tie is due to the DNA interface, is 2rt8:

# Topologically valid assemblies in 2rt8
 id   Interf cluster ids       Size   Stoichiometry        Symmetry      Score    Predicted by
  1                   {}        1,1             A,A           C1,C1       0.50                
  2                  {1}          2             A B              C1       0.50            pdb1

These cases should be very rare though.

lafita commented 7 years ago

I think that with the current solution (see PR #166) we can make the release, so I will assign the further discussion of this issue to the 3.1 milestone.

lafita commented 7 years ago

We said we could implement a very naive scoring for nucleotide only interfaces. We could calculate the number of base pairs between the two chains (using the distances between H-bonding atoms) and then give probability one if there are more than, say, two bp, or probability zero if there are less.