Solve the protein box split glitch

Jesse-V commented 11 years ago

Occasionally, FAHViewer has a glitch where a piece of the atom appears completely offset from the rest. https://fah-web.stanford.edu/projects/FAHClient/ticket/975

There's a kind of bounding box that is set around the protein. Apparently, the core uses Fourier transforms as part of its force calculations, and in order fit with the equations, the bounding box is considered periodic. That is, it's mathematically repeated in all directions to infinity. FAHClient and FAHViewer attempt to identify a single instance of this protein in real space. This split is a glitch with this algorithm: one piece of the protein is sort of wrapped around the box. It's supposed to be entirely contained in the box, like |---abcde--| but instead it's like |bcde-----a| which is mathematically the same to the Fourier transforms.

An approach to fixing this would be: Out of all the atoms, pick an atom, call it A. Assign a subset index to A, and recurse from A through all of its neighbors that are within that bond distance. Assign all of them to that same subset index. When the recursion completes, all of the atoms that can be connected to A are in that subset. Take the atoms that are outside of this, and repeat this algorithm until all the subsets are identified. The main set is the largest of these subsets, and the others are glitched. Move each of these sets of glitched atoms by the distance of the protein bounding box until they are in line with the main set.

The solution involves a subproblem: finding the list of atoms that within the bond distance to a particular atom, for all atoms. Obviously this can be done naively in O(n^2) time, but apparently there's a better way.

< proteneer_osx> you could just do what we do, make a neighbor list < proteneer_osx> ok we don't quite build a standard nighbor list < proteneer_osx> but you don't need octrees or kd-trees < proteneer_osx> a neighbor list can be built in O(N f(X) ) time < proteneer_osx> by simply hashing the coordinates into a bucket in 3space < proteneer_osx> where the bucket dimensions are X by X by X < proteneer_osx> ya then for each point, you know which bucket it belongs to, and you traverse its 3^3 neighbors (including itself) < Jesse_V> what's your hash function? < proteneer_osx> you can just do < proteneer_osx> map<vector bucket, vector pointIndices> < proteneer_osx> let STL take care of it for you < proteneer_osx> its an RB tree technically speaking < proteneer_osx> but its fast enough for all your needs < Jesse_V> I've used multimaps and maps. Way handy. < proteneer_osx> where vector is of size 3 < proteneer_osx> denoting the x,y,z coords < proteneer_osx> you need to overload a comp operator probably < proteneer_osx> you could actually just make your own hash function to 1D pretty easily

Divide the atoms into a grid of the bond distance. Thus, every atom is mapped to some bucket, or rather, each bucket is multimapped to a set of atoms. Then each atom in that bucket can check the surrounding 27 (3^3) buckets, including itself, for neighbors that are within that distance. Obviously using sqrt should be avoided; dealing with distance^2 should be fine.

Jesse-V commented 11 years ago

screenshot from 2013-08-28 10 13 23 As suspected, this is a problem with the data returned by FAHClient. Here's an example of it occuring for a CPU WU my laptop is working on.

proteneer commented 11 years ago

There's a fairly simple solution provided you have access to the periodic box size. We first ask the question, what is upper bound on how far apart two points can be?

First, let me introduce two distance metrics:

standard_euclidean() and periodic_euclidean()

the standard_euclidean is computed as:

double standard_dist(double3 a, double3 b) {
    double dx, dy, dz;
    dx = a.x-b.x;
    dy = a.y-b.y;
    dz = a.z-b.z;
    return sqrt(dx*dx+dy*dy+dz*dz);
}

the periodic_euclidean is computed as:

// calculates the shortest distance between two points given a period
inline double periodicDifference(double x1, double x2, double period) {
    double diff = x1-x2;
    diff -= floor(diff/period+0.5)*period;
    return diff;
}

double periodic_euclidean(double3 a, double3 b, double3 boxSize) {
    double dx, dy, dz;
    dx = periodicDifference(a.x,b.x,boxSize.x);
    dy = periodicDifference(a.y,b.y,boxSize.y);
    dz = periodicDifference(a.z,b.z,boxSize.z);
    return sqrt(dx*dx+dy*dy+dz*dz);
}

One upper bound for how far two points can be is:

max_d = max bond length, which is no more than 2 angstroms, or 0.2 nm.

You can create a connectivity tree(*) that can be used to represent the protein structure since we have bond information. The node of the tree are atoms, the edges represent connectivity. For each edge in the tree we compute the standard euclidean distance, which will be much greater than max_d for certain edges. You basically end up with a tree, with a bunch of bad edges. These bad edges need to be fixed via traversal from the bottom of the tree (DFS followed by backtracking or something).

When you come across a node V with a bad edge leading to the parent V_p, you translate the atom represented by node V, and all the children V_c of V, and all of V_c's children, etc. How do you translate atom_V belonging to node V? Well, you can easily tell where to shift atom_V coordinates to by looking at the periodic difference in each dimension. That is, you want the coordinates of atom_V to be shifted so that the periodic_euclidean is equal to the standard_euclidean.

Note, when unwinding, you want to make sure each node has all of its edges cleaned up before moving up one level higher.

Once you've traversed to the top of the tree, you have a perfect looking molecule

(*) I know there are ring structures, but they are cycles of at most 6 edges, so for all practical purposes we have a tree-like structure.

Jesse-V commented 11 years ago

I switched over to a 3D vector of Buckets for the hashing, and experience an enormous speedup. Here's what it does so far:

... done creating SlotViewer. [concurrent] Analyzing protein for splits... [concurrent] Hashing atoms into buckets of size 2... ... done creating Viewer. [concurrent] ...done hashing into buckets. Took 2.791ms [concurrent] Identifying groups... [concurrent] Assigning 0... [concurrent] ...done assigning group. [concurrent] ...done identifying groups. Took 3.238ms [concurrent] Assembling groups... [concurrent] ...done assembling groups. [concurrent] ...done analyzing protein.

Now that I'm able to identify the pieces in a few milliseconds, all that remains now is to rearrange them and stitch the pieces together. That, unfortunately, is the tricky bit. I'm shoving this down the road a bit past v0.7.

bb30994 commented 11 years ago

Suppose we have a protein that's fully connected in the periodic box. Suppose that in the next snapshot we find two blobs which are the result of passing a plane through the protein and separating their external faces by the dimension of the box. The distances between atoms near each of the planes should give us enough information to guess where both planes are and to estimate the width of the box in that direction. Transforming the atoms of one blob by that distance should place those two faces near enough to each other for us to (re_)establish atomic bonds that do not traverse the entire box (less than 2 Å).

If we have a previous snapshot, I'd think that we could remember the bond list and it would tell us a lot about how to reconnect the blobs -- though we'd have to be working in a Viewing box, not the Periodic box.

The protein can penetrate more than one plane, so the transformations would have to be done recursively for (N-1) blobs.

Jesse-V commented 11 years ago

Hi Bruce. Sorry, I don't quite follow you. What do you mean by "external faces"?

I have access to all the available snapshots, which specify the positions of all the atoms. I also have the bond list. Right now I just analyse snapshot 0, but it's easy enough to check the other snapshots. What procedure do you suggest?

(discussing this here seems like a good choice, I agree)

bb30994 commented 11 years ago

First, I discovered that checkpointState.xml starts with the following: Even though you don't get that information from the API, it does look like the periodic box is a cube (at least for this project) and it would be easy to send the coordinates via the API. Whether that's true for other FahCores is more likely difficult to know.

<?xml version="1.0" ?>

Jesse-V commented 11 years ago

Good to know. I thought that there was a way to ask FAHClient to read a WU file for me, but I don't see it listed in --help when I looked again. Maybe I missed it. If there was such a way, it wouldn't be difficult to parse out that information assuming that it's in the file. If the periodic box is so important, it seems logical that it would be stored.

I was referring to your earlier statements. See my comment above.

bb30994 commented 11 years ago

A "proper" use of V7 is to stick to the API and do not parse any files. That presumes that the API provides the necessary data. For a better understanding of a particular condition and for code debugging, I see nothing wrong with parsing the files but (A) the final code should stick to the API, and (B) if the API doesn't provide the necessary data, a V7 enhancement request is appropriate. In the final analysis, FAHViewer needs the same data that you do so the API needs to supply it.

Suppose the protein is 2D and centered in the box. I'm assuming the bond information is available and it connects all adjacent pairs of atoms.

. . . . . .. a . . . . b... . c . . .d . . . . . . e f . . . . . . . . . . g h . . . . . . . . . i . .j .. . . . . . k . . . .l . . . m . . . . . n Now shift the protein relative to the periodic box (or the box relative to the protein) so that a plane passes between b-d and j-k and another plane passes between a-b and a-c. The atoms on the left will appear an a separate bucket and all will be transformed to be inside a single box. (Spacing may not look right depending on the font used, and this isn't equal spacing as it's rendered on my screen.). OOPS. I spent a long time putting spaces in the image.Now I'll go back and add periods to force a visual grid.

. . . |.. a......|.... . . . | .b . . . .c . . . . . . | . . d|. . . . . . . e . . . .. .| f. . .| . . . . . . . .g . . . . | h . .| . . . . . . . . i . . . . | . . j | . . . . . . k . . . . . .| . . . | l . . . m . . . . . . . | . . . | . . n . . . . . . . . .. | . . . |. . . . . . . . . . . . .. | . . . |..........|...._

Remember that the box is periodic and the protein appears in every box, including the one above and the one to the right so the protein in those boxes protrudes across those planes but you only see one value of the coordinates for each atom so the other boxes are invisible. With the current rendering methodology, you only see the uppercase atoms as three pieces. |.. a......|...a.. | B . . . .C . . . . . . | .b . . c |. . . . . . . E . . . .D|. . . . . . e | . . . . . . . .G . .F. .| . . . . . . .g | . . . . . . . . I . . H .| . . . . . . . i | . . . . . . K . . . . . J| . . . . . k | L . . M . . . . . . . . | l . . m | . . N . . . . . . . . .. |. . n |. . . . . . . . . . . . .. | |..A.......|._.a.._ | .b . . . .c

Now, without seeing the first two images, identify three buckets. Bucket DFHJ is separate from BCDGIKLMN and from bucket A. Check the bonds between d-f and f-h and h-g (which still make sense, visually) and the bonds b-d and j-l which make sense in the first image but do not in the last one -- unless you transform the d-f-h-j data one box-width in the -X direction.to get the second image. Similarly, transform bucket A in the +Y direction Recenter the protein in the Viewing box. Bonds a-b and a-c will also be restored to a rational length and the image will be one piece again.

If we don't know the box size, we'll have to infer where the planes are to know how far to transform those two buckets. We have almost enough data from the lengths of the bonds a-b and a-c and j-l and b-d to make an intelligent guess and unless we magnify that part of the final image, it's probably "good enough" for visual purposes. .

proteneer commented 11 years ago

bruce, I recommend using a pair of code quotes (triple `) to block off your diagrams so it's more visually accessible.

Jesse-V commented 11 years ago

Oh I think I see what you're getting at. You're saying that we if we only know the coordinate of one end of the box, we can deduce the size of the box (and thus the dimensions and location of the whole box) by transforming the atoms and watching how they split up, yes? The problem with that approach is that I can transform the atoms all I want, but I don't know how make them wrap without knowing the box. The information I'm getting from FAHClient is kind of like a black box.

bb30994 commented 11 years ago

D is in bucket # 2 and it's bonded to b in bucket # 1. What is the length of the bond?

The visual off-set is [periodicboxwidth+(dx,dy,dz)] instead of (dx,dy,dz) and we know |(dx,dy,dz)|<2Å

We also know periodicboxwidth >> (dx,dy,dz)

Jesse-V commented 11 years ago

Ah. Well I like your idea, it seems doable. But the trick is to figure out how to transform the atoms such that they wrap around the box. Without knowing the box, I don't think I can do it in my program. It's almost like I need to get the information from FAHClient, transform it, and then feed it back through FAHClient and see what the difference is.

bb30994 commented 11 years ago

Find a protein that's split. Parse the checkpoint file and bucketize the atoms. Pretend you don't know the periodic vector. The bond from atom B to atom D is too long in one direction by some number. So is the bond between J and L, by the same number. Estimate that number as boxSize.x. Similarly the visual bond lengths A-B and A-C is off by boxSize.y. You might have to move a bucket by boxSize.x and/or boxSize.y and/or boxSize.z (actually by K*boxSIze but most likely K belongs to the set {-1,0,1})

.......... | B . . . .C . . . . . . | |. . . . . . . E . . . .D| | . . . . . . . .G . .F. .| | . . . . . . . . I . . H .| | . . . . . . K . . . . .J| | L . . M . . . . . . . . | | . . N . . . . . . . . .. | |. . . . . . . . . . . . .. | |._.A.......|

bb30994 commented 11 years ago

I have now seen two examples of a single atom in Antarctica. Is that siy like the A in my example, or is there a coordinate error in the FAHCore and the API is sending (0,0,0)?

bb30994 commented 11 years ago

bruce, I recommend using a pair of code quotes (triple `) to block off your diagrams so it's more visually accessible. Pair of what??

(edited for clarity by Jesse V)

Jesse-V commented 11 years ago

@bb30994, I don't know where those single atoms are coming from. I'm tempted to just remove all single floating atoms: it's probably not worth it to reattach them, and if we want to take that route, it doesn't really matter where they come from.

One problem is that FAHClient filters out bonds that are too far apart. This means that when I load the information, the split proteins are all interconnected, but there's no bonds between the two groups. In other words, if protein ABCDEF is split into ABC and DEF, there will be bonds among ABC and there will be bonds among DEF, but there are no bonds from ABC to DEF because they are too far apart. If this weren't the case, it would be much easier to stitch things back together.

Jesse-V commented 11 years ago

@bb30994, you may recall the emails I sent earlier to @jcoffland regarding bonds that stretched too far all the way across the protein. I identified the problem as an off-by-one error in my parsing code, and I fixed it. Things now render properly, but that glitch was on my end, and unfortunately in this context, wasn't from FAHClient.

bb30994 commented 11 years ago

In my example, there are two bonds crossing the +X and the -X plane. One connects bucket CEGIKMNL with bucket DFHJ and bucket B. The distance dx1 +- periodicBoxLength.x is a lot bigger than a bond can be.so the problem is extracting the values dx1 from its sum with +-periodicBoxLength.x -- so either we have to guess or the API has to tell us the value of periodicBoxLength.x The value of dx2 +- periodicBoxLength.x gives us a little more information.

Similarly, the bonds from A to B and C cross the external +Y and -Y planes. and dy1+periodicBoxLength.y and dy2+periodicBoxLength.y has to be derived from some rational assumptions about what dy1 and dy2 should be.

Like Proteneer said above, all relocations are by a vector of periodicBoxLength.*. Knowing periodicBoxLength.x or having some estimate of it, you can transform bucket DFHJ periodicBoxLength.x to the left and similarly, you can transform bucket A down by periodicBoxLength.y, reconstructing bonds with realistic lengths. The protein will no longer be inside the periodicbox, but it will be connected.and the long bonds that have been suppressed will be rational lengths. To view it, you have to create your own viewing window but that's easy now that the protein is fully connected.

I've asked Joseph to pass data for the long bonds through the API in a way that still tells FAHViewer not to display them. We need to know all of the vector lengths that contain the added term of periodicBox(x,y,z). because that's how we know how to reconnect the buckets: by transforming them by one or more of the (see 11 posts above this one.)

If we do need to make a guess, do we know anything more about the bond length? Is there a minimum length? Is the magnitude the same as it was before the protein was split by the plane? Is the box always cubical or was the example I pulled just a special case where all three vectors happened to have length 5.593071324260671?

bb30994 commented 11 years ago

If you view the protein in the box & it's all one bucket, there's no problem but that can change in the next snapshot. I think you may need to repeat the analysis at every snapshot. If there are 2 pieces, it's because a plane of the box intersected the protein, creating 2 faces, splitting it in two with those two faces attached to external faces of the periodic box. Those two faces need to be transformed so they're visually joined again, The bonds that cross the plane of the box will be the long ugly ones Joe inhibited because they reach almost completely across the box. In my 2D example, the top plane cuts bond A-B and bond A-C so th ends of those two bonds are near the top and bottom planes of the box -- & visually they still need to me near each other.

Bruce

Jesse-V commented 11 years ago

That makes sense. It's easy to run the analysis for every snapshot, and we'd have to fix each snapshot too once we have enough information to do so. The amount of additional time required is clearly then linear to the number of snapshots, which is rarely more than a few dozen or so, so it's still pretty fast.

© Githubissues.

Githubissues is a development platform for aggregating issues.

Jesse-V / Folding-Atomata

Solve the protein box split glitch #26