Closed kls93 closed 7 years ago
I've been thinking and I think it's a great idea to switch to the moving window averaging rather than binning. Have you spoken to Elspeth about this? She'll need to know. Since moving averages are pretty common (particularly in time series models e.g. financial markets) I imagine that the 'problem' with the "extreme" values with have been solved/patched by someone already. So I would take a look to see how others have done it before you potentially reinvent the whee so to speak. Also, regarding determining the window size you have to defined what "best" is, which I'm sure you're already aware. Perhaps you may choose a window size such that the Bdamage of highly damaged atoms are significantly different from others (not necessarily statistically significant). Again that needs defining (first suggestion: the summed absolute/squared difference between the damaged atoms and the maximum Bdamage of undamaged atoms is as large as possible - but the flaw may be that the maximum Bdamage of undamaged atoms may be abnormally high so it may not be robust).
But I like your idea, I think it's great :+1:
Thanks Jonny - I'm relieved you like it! I spoke to Elspeth about it on Friday and she liked the idea as well, so I'm good to go ahead and change it.
Hi all,
I’ve changed the programme so that it bins by sliding window; I now need to determine the “best” sliding window to use. With respect to this I would value your opinions (if / when you have time!) on the following issues:
Thanks very much for all of your help :) Please don’t reply unless you have time!
What is the status of this now?
The binning is performed via sliding window. The size of the sliding window is 2% of the total number of atoms / 15 (whichever is larger). These values have been selected somewhat arbitrarily. I have tried however varying the window size (both as % of the total number of atoms, and fixed size), and reassuringly I don't see substantial changes in BDamage.
My view now a year later is that I am not inclined to spend significant further time optimising these parameter values, given that a) the current values work and b) the optimal values will differ between proteins.
That sounds cool to me. If they work for the use cases then the ROI in the time spent parameter optimising will be tiny. If anyone else experiences problems then we can reopen the issue. I'll close this for now then.
As you both know, I would like to change how the programme bins the atoms, to prevent wildly different bin sizes (which, unless the distribution of Bfactor values in each bin adopt a normal distribution, will cause the Bdamage value of an atom in a bin to vary with the number of atoms in that bin, in particular in the case of the atoms with the most extreme Bfactor values in the bin).
There are several ways I could do this, but my current preference is to drop the idea of binning altogether and instead calculate Bdamage from a moving average (i.e. I would calculate the packing density of every atom, order the atoms by packing density, then calculate the Bdamage value of each atom by calculating the ratio of its Bfactor with the average Bfactor of neighbouring atoms in the list within a window of e.g. 20 (or maybe by a number based upon the total number of atoms in the structure being considered - I'd have to determine the best window size empirically)). Potential problems with this are treatment of the extreme values in the packing density list (for these values will not be at the centre of the window), hopefully this shouldn't affect the Bdamage values calculated for these atoms to too great an extent however. I currently prefer this idea as it will in addition prevent the programme from returning artefactually large/small differences in Bdamage value for atoms at the margins of adjacent bins.
Do you think this is a good / terrible idea?
P.S. (Sorry to keep bothering you both, I just don't want to make fundamental changes to how the programme operates without you being happy with them)