Binning - Githubissues

kls93 commented 8 years ago

As you both know, I would like to change how the programme bins the atoms, to prevent wildly different bin sizes (which, unless the distribution of Bfactor values in each bin adopt a normal distribution, will cause the Bdamage value of an atom in a bin to vary with the number of atoms in that bin, in particular in the case of the atoms with the most extreme Bfactor values in the bin).

There are several ways I could do this, but my current preference is to drop the idea of binning altogether and instead calculate Bdamage from a moving average (i.e. I would calculate the packing density of every atom, order the atoms by packing density, then calculate the Bdamage value of each atom by calculating the ratio of its Bfactor with the average Bfactor of neighbouring atoms in the list within a window of e.g. 20 (or maybe by a number based upon the total number of atoms in the structure being considered - I'd have to determine the best window size empirically)). Potential problems with this are treatment of the extreme values in the packing density list (for these values will not be at the centre of the window), hopefully this shouldn't affect the Bdamage values calculated for these atoms to too great an extent however. I currently prefer this idea as it will in addition prevent the programme from returning artefactually large/small differences in Bdamage value for atoms at the margins of adjacent bins.

Do you think this is a good / terrible idea?

P.S. (Sorry to keep bothering you both, I just don't want to make fundamental changes to how the programme operates without you being happy with them)

By 'neighbouring atoms in the list' I mean similar packing densities, rather than being close in space.

JonnyCBB commented 8 years ago

I've been thinking and I think it's a great idea to switch to the moving window averaging rather than binning. Have you spoken to Elspeth about this? She'll need to know. Since moving averages are pretty common (particularly in time series models e.g. financial markets) I imagine that the 'problem' with the "extreme" values with have been solved/patched by someone already. So I would take a look to see how others have done it before you potentially reinvent the whee so to speak. Also, regarding determining the window size you have to defined what "best" is, which I'm sure you're already aware. Perhaps you may choose a window size such that the Bdamage of highly damaged atoms are significantly different from others (not necessarily statistically significant). Again that needs defining (first suggestion: the summed absolute/squared difference between the damaged atoms and the maximum Bdamage of undamaged atoms is as large as possible - but the flaw may be that the maximum Bdamage of undamaged atoms may be abnormally high so it may not be robust).

But I like your idea, I think it's great :+1:

kls93 commented 8 years ago

Thanks Jonny - I'm relieved you like it! I spoke to Elspeth about it on Friday and she liked the idea as well, so I'm good to go ahead and change it.

kls93 commented 8 years ago

Hi all,

I’ve changed the programme so that it bins by sliding window; I now need to determine the “best” sliding window to use. With respect to this I would value your opinions (if / when you have time!) on the following issues:

Currently I solve the problem of extreme values (the endpoint problem) by effectively binning these atoms. (E.g. in the case of a window size of 21, I calculate the Bdamage values of each of the 11 atoms which possess the lowest packing density values in the macromolecule as the ratio of its Bfactor as compared to the average Bfactor of the 21 atoms with the lowest packing density values. I use a similar approach for the 11 atoms with the highest packing density values, whilst for all other atoms I use a sliding window with the atom in question at its centre.) However, I could instead use predicted average Bfactors (based upon extrapolation of the relationship between packing density and Bfactor) to calculate the Bdamage values of the endpoint atoms (this is how endpoints are typically dealt with when using a moving average). I’m reluctant to do this though, because if I plot packing density against Bfactor although as expected they show a general inverse relationship, there is no clear, consistent relationship between the two which I could extrapolate, therefore I worry that I would potentially introduce rather than remove error by using this approach. So my question is, do you think my current approach to the endpoint problem is acceptable, should I switch to extrapolation, or are both solutions unsatisfactory and I should return to the binning approach?
I can choose to either use a window of a fixed size (such that the total number of windows taken will increase in proportion with the number of atoms in the macromolecule), or use a fixed number of windows (such that I vary the size of the window based upon the total number of atoms in the macromolecule). Currently I favour using a fixed number of windows; this is because if we assume that the range of packing densities is independent of size, if the size of the window remains constant then as the size of the protein increases the similarity in packing density of the atoms compared within the window increases. (Note that the relationship between packing density and Bfactor is not linear, therefore large changes in window size would be likely to have a not insignificant effect upon the average Bfactor value calculated from the atoms within the window). However, an advantage of using a fixed window size is that the extent to which an extreme value will affect the average Bfactor calculated from the window will remain constant between proteins. Therefore I would like to use a fixed number of windows, but to set a lower limit to the size of the window to prevent the average Bfactor values calculated from windows taken across very small proteins from being highly susceptible to extreme individual Bfactor values. Do you think this is a good / terrible idea?
To determine the “best” window size, I have tested several potential fixed number window sizes (namely 1, 2, 3, 4 and 5% of the total number of atoms for which Bdamage is being calculated for within the macromolecule) on the Nanao et al. radiation damage series (= 12 structures in total (6 different proteins tested, a ‘low’ and a ‘high’ damage dataset being collected from each of these proteins for use in RIP phasing)). To compare window sizes, in the case of each window size I have compared the global Bdamage values (for convenience - I could use another measure of specific radiation damage) of equivalent (i.e. collected from the same protein) low and high damage datasets. As might be expected given that Bdamage successfully identified sites of specific radiation damage between the low and high datasets of this radiation damage series with the binning approach (Markus used the Nanao et al. radiation damage series to validate the binning version of Bdamage), all of the tested window sizes work; furthermore, across the range of window sizes tested, there isn’t a systematic relationship between window size and the size of the difference in global Bdamage detected between equivalent low and high datasets. I could try a wider range of window sizes, however given that all of those tested so far work, I am tempted not to bother looking for the ‘optimum’ window size for this radiation damage series (in particular as this ‘optimum’ window size would almost certainly change if I were to change the radiation damage series used for validation), but rather use one of the values tested already. Specifically, I am leaning towards 1/2% - I think this is a good compromise between minimising the range of packing densities compared within a window whilst also ensuring if I were to enforce a minimum window size of e.g. 21 that only macromolecules with less than 1000 atoms (there aren’t many of these) would be forced to use a larger % window size. Similarly, I am thinking of setting the lower limit of the window size to a value somewhere around 11-21 (such a size I feel would be a good balance between ensuring that the effect of an extreme Bfactor value within the window will not skew the average Bfactor value to too great an extent, whilst also ensuing that this window will not form too great a % of the total number of atoms in the case of the very small proteins to which this limit is applied). I will test a range of different lower limits to check whether or not this is the case, but assuming that there is no obvious best candidate, as in the case of the window size, again I would be tempted to just pick one of those which works based upon reasoning such as that above, rather than attempting to determine the 'optimal' value which will likely vary slightly depending upon the radiation damage series used to determine it. So, in summary, it seems to me that selecting the exact "best" parameter values is going to be very difficult, hence it might be better to arbitrarily select values which, whilst might not be optimal, do work. However, I know that making my window size and / or lower limit selection in such a manner wouldn’t be very scientific, so I can see why you might disagree with me. I would be interested in your thoughts on this.

Thanks very much for all of your help :) Please don’t reply unless you have time!

JonnyCBB commented 7 years ago

What is the status of this now?

kls93 commented 7 years ago

The binning is performed via sliding window. The size of the sliding window is 2% of the total number of atoms / 15 (whichever is larger). These values have been selected somewhat arbitrarily. I have tried however varying the window size (both as % of the total number of atoms, and fixed size), and reassuringly I don't see substantial changes in BDamage.

My view now a year later is that I am not inclined to spend significant further time optimising these parameter values, given that a) the current values work and b) the optimal values will differ between proteins.

JonnyCBB commented 7 years ago

That sounds cool to me. If they work for the use cases then the ROI in the time spent parameter optimising will be tiny. If anyone else experiences problems then we can reopen the issue. I'll close this for now then.

GarmanGroup / RABDAM

Binning #11