Closed horaciobam closed 2 years ago
First I thought substitution rates are being wrongly calculated on : https://github.com/PoonLab/HexSE/blob/70b6341b8ea9de7356b4ee91d8a93a4dc8b902fb/hexse/sequence_info.py#L351-L404
But printing nucleotide 2450, that didn't mutate, I see that substitution rates for each nucleotide are being properly calculated:
>>>>> T 2450 {'A': 0.0}
>>>>> T 2450 {'A': 0.0, 'C': 0.0032697286483108056}
>>>>> T 2450 {'A': 0.0, 'C': 0.0032697286483108056, 'G': 0.00011542378622683285}
Exploring both probablity_tree
and event_tree
I notice some inconsistencies:
>>>>>>>>>>>>>>>>>>>>> Probability tree {'to_nt': {'A': {'number_of_events': 1270, 'from_nt': {'A': None, 'T': {'prob': 0.18749999999999997, 'cat': {'mu1': {'prob': 0.15865525393132512, 'omega': {((0, 0, 0, 1, 0),): {'prob': 0.0189688736429855, 'number_of_events': 3}, ((0, 1, 0, 0, 0),): {'prob': 6.120879748328122e-07, 'number_of_events': 5}, ((1, 0, 0, 0, 0),): {'prob': 7.818002314774208e-15, 'number_of_events': 3}, ((0, 0, 1, 0, 0),): {'prob': 7.017537737073388e-23, 'number_of_events': 2}, ((0, 0, 0, 0, 1),): {'prob': 7.017537737073388e-23, 'number_of_events': 4}, ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): {'prob': 4.5132776109673465e-57, 'number_of_events': 4}, ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): {'prob': 1.0701792730523098e-103, 'number_of_events': 5}, ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 2.6236224917117375e-137, 'number_of_events': 3}, ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): {'prob': 2.722953835690513e-184, 'number_of_events': 10}, ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 4.706221643657749e-232, 'number_of_events': 5}, ((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): {'prob': 6.7715449301244425e-307, 'number_of_events': 1}, ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): {'prob': 0.0, 'number_of_events': 2}, ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 0.0, 'number_of_events': 6}, ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): {'prob': 0.0, 'number_of_events': 14},
((0, 1, 0, 0, 0),): {'prob': 6.120879748328122e-07, 'number_of_events': 5}
, with a None
value. I think it happens when I remove the key if no more nucleotides are assigned to that branch. I haven't been able to prove this yet. ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 0.0, 'number_of_events': 6}
. This should not happen for a branch with 6 events. >>>>>>>>>>>>>>>>>>>>>EVENT TREE: {'to_nt': {'A': {'from_nt': {'A': None, 'T': {'category': {'mu1': {((0, 0, 0, 1, 0),): [t3, t75, t97, t108, t2, t127], ((0, 1, 0, 0, 0),): [t12, t51], ((0, 0, 1, 0, 0),): [t13, t55, t74, t85, t128], ((1, 0, 0, 0, 0),): [t32, t136], ((0, 0, 0, 0, 1),): [t62, t65, t95, t137], ((0, 0, 1, 0, 0), (0, 0, 0, 0, 1, 0, 0)): [t166, t336, t381, t529, t593], ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): [t171, t280, t369, t394, t677], ((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): [t178, t379, t385, t832], ((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): [t181, t423, t615, t637, t811], ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): [t191, t266, t269, t290, t302, t347, t365, t392, t443, t461, t473, t533, t539, t581, t605, t626, t662, t713, t722, t809, t821], ((0, 0, 0, 1, 0), (0, 0, 0, 1, 0, 0, 0)): [t210, t267, t750], ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): [t211, t212, t277, t516, t730, t771], ((1, 0, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): [t214, t271, t360, t621], ((1, 0, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): [t219, t255, t261, t582], ((0, 1, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): [t238, t418, t673, t714, t732, t736], ((0, 0, 1, 0, 0), (0, 1, 0, 0, 0, 0, 0)): [t248, t553, t640, t648, t654], ((0, 0, 1, 0, 0), (0, 0, 0, 1, 0, 0, 0)): [t258, t332, t421, t484], ((1, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): [t275, t685, t756], ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): [t278, t789], ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): [t318, t627], ((0, 0, 0, 1, 0), (0, 0, 0, 0, 0, 1, 0)): [t327, t345, t702, t778, t801], ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): [t348, t415, t741, t799, t826], ((0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): [t375, t397, t703, t808], ((0, 0, 0, 1, 0), (0, 1, 0, 0, 0, 0, 0)): [t393, t767, t790], ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): [t400, t555], ((1, 0, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): [t406], ((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): [t432, t664, t692, t745, t815], ((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)): [t435, t549, t663], ((0, 0, 0, 1, 0), (0, 0, 1, 0, 0, 0, 0)): [t561, t570, t691, t739], ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): [t682], ((1, 0, 0, 0),): [t1822, t1833, t1837, t2864, t2889, t2988, t3003, t3036, t3066], ((0, 0, 0, 1),): [t1823, t1847, t2878, t3052, t3094, t3097, t3148], ((0, 0, 1, 0),): [t1825, t1831, t2862, t2885, t2886, t2888, t2915, t2916, t2935, t2942, t3014, t3116, t3140]
I notice that on the event tree nucleotides on the tips are grouped by regions of the genome: [t178, t379, t385, t832]
, [t3, t75, t97, t108, t2, t127]
or [t1822, t1833, t1837, t2864, t2889, t2988, t3003, t3036, t3066]
. I wonder if this is due to the different omegas for each reading frame:
orfs:
2849,3182:
omega_classes: 3
omega_shape: 1.5
omega_dist: gamma
0,837:
omega_classes: 4
omega_shape: 1.7
omega_dist: gamma
3173,3182:
omega_classes: 5
omega_shape: 1.9
omega_dist: gamma
156,837:
omega_classes: 6
omega_shape: 1.2
omega_dist: gamma
1375,1840:
omega_classes: 7
omega_shape: 1.30
omega_dist: gamma
1815,2454:
omega_classes: 3
omega_shape: 1.25
omega_dist: gamma
1853,1922:
omega_classes: 4
omega_shape: 1.87
omega_dist: gamma
1902,2454:
omega_classes: 5
omega_shape: 2.5
omega_dist: gamma
Should omegas be similar for different reading frames or it is ok to have different paremeters of the gamma for each ORF
Reformatting probability tree example from above to emphasize hierarchy:
{
'to_nt': {
'A': {
'number_of_events': 1270,
'from_nt': {
'A': None,
'T': {
'prob': 0.18749999999999997,
'cat': {
'mu1': {
'prob': 0.15865525393132512,
'omega': {
((0, 0, 0, 1, 0),): {
'prob': 0.0189688736429855,
'number_of_events': 3
},
((0, 1, 0, 0, 0),): {
'prob': 6.120879748328122e-07,
'number_of_events': 5
},
((1, 0, 0, 0, 0),): {
'prob': 7.818002314774208e-15,
'number_of_events': 3
},
((0, 0, 1, 0, 0),): {
'prob': 7.017537737073388e-23,
'number_of_events': 2
},
((0, 0, 0, 0, 1),): {
'prob': 7.017537737073388e-23,
'number_of_events': 4
},
((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): {
'prob': 4.5132776109673465e-57,
'number_of_events': 4
},
((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): {
'prob': 1.0701792730523098e-103,
'number_of_events': 5
},
((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {
'prob': 2.6236224917117375e-137,
'number_of_events': 3
},
((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): {
'prob': 2.722953835690513e-184,
'number_of_events': 10
},
((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {
'prob': 4.706221643657749e-232,
'number_of_events': 5
},
((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): {
'prob': 6.7715449301244425e-307,
'number_of_events': 1
},
((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): {
'prob': 0.0,
'number_of_events': 2
},
((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {
'prob': 0.0,
'number_of_events': 6
},
((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): {
'prob': 0.0,
'number_of_events': 14
},
I think there's something wrong with how probabilities are being calculated for the discretized gamma distribution. Is this distribution actually being discretized? (we should be integrating the probability density from one value at the left of a rate category to the right, not taking the pdf at a single point)
When calculating the values omega values for each orf
, we are getting the following dict:
{
'+': [
{
'coords': [[2849, 3182]],
'omega_shape': 1.5,
'omega_classes': 3,
'omega_values': [0.1708353283825978, 0.4810288100937172, 1.1481358615121404]
},
{
'coords': [[0, 837]],
'omega_shape': 1.7,
'omega_classes': 4,
'omega_values': [0.17314944359111226, 0.42077177796287807, 0.7209957283986081, 1.4050830500389533]
},
{
'coords': [[3173, 3182]],
'omega_shape': 1.9,
'omega_classes': 5,
'omega_values': [0.1846759438880899, 0.4079605248732454, 0.6343987842679637, 0.9365407795137214, 1.636423967437797]
},
],
'-': []
}
The values do not sum one because the omega_values
are being drawn from a discretized gamma distribution as following:
def discretize(alpha, ncat, dist):
"""
Divide a distribution into a number of intervals with equal probability and get the mid point of those intervals
From https://gist.github.com/kgori/95f604131ce92ec15f4338635a86dfb9
:param alpha: shape parameter
:param ncat: Number of categories
:param dist: distribution of probabilities, can be a string
:return: array with ncat number of values
"""
# Handle if distribution was specified from input file (parsed as a string)
if dist == 'ss.gamma' or dist == 'gamma':
dist = ss.gamma
elif dist == 'ss.lognorm' or dist == 'lognorm':
dist = ss.lognorm
if dist == ss.gamma:
dist = dist(alpha, scale=0.4)
elif dist == ss.lognorm:
dist = dist(s=alpha, scale=0.5) # scale=np.exp(0.05 * alpha**2)
quantiles = dist.ppf(np.arange(0, ncat) / ncat)
rates = np.zeros(ncat, dtype=np.double)
for i in range(ncat-1):
rates[i] = (ncat * scipy.integrate.quad(lambda x: x * dist.pdf(x), quantiles[i], quantiles[i+1])[0])
rates[ncat-1] = ncat * scipy.integrate.quad(lambda x: x * dist.pdf(x), quantiles[ncat-1], np.inf)[0]
return rates
How should this function work instead?
This not solve the problem of some events having very low (even zero) probabilities in very specific parts of the sequence.
When we are selecting the mutation, I think that we are somehow also modifying the event tree due to the implementation of a deep.copy
:
Why is there something being raised to the power of something else here? https://github.com/PoonLab/HexSE/blob/ec9adb193d19ab21b9e245316d7421c5405a8bc0/hexse/sequence_info.py#L443
This denominator is a potential cause of vanishingly small probabilities: https://github.com/PoonLab/HexSE/blob/ec9adb193d19ab21b9e245316d7421c5405a8bc0/hexse/sequence_info.py#L140
I think total_omegas
are actual omega values (dN/dS), not probabilities or rates.
Wondering what the difference between omegas
and self.total_omegas
is:
Need to check if synonymous mutations are being miscalculated and how synonymous mutations are being stored:
https://github.com/PoonLab/HexSE/blob/36c257f7fee3b760c03f8ba9095584dfee018dcd/hexse/sequence_info.py#L174-L192
I'm wondering if line 175 should be inside the loop - otherwise we keep appending entries to nonsyn_values
This code block needs better documentation!
@GopiGugan omegas
is a list of the keys for omega values available on the tree:
dict_keys(
[
((0, 0, 1, 0, 0),),
((1, 0, 0, 0, 0),),
((0, 0, 0, 0, 1),),
((0, 1, 0, 0, 0),),
((0, 0, 0, 1, 0),),
((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)),
((0, 0, 0, 1, 0), (0, 0, 0, 0, 0, 1, 0)),
((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)),
((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)),
((1, 0, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)),
((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)),
((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)),
((0, 0, 1, 0, 0), (0, 1, 0, 0, 0, 0, 0)),
((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)),
((1, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)),
((0, 0, 0, 1, 0), (0, 0, 1, 0, 0, 0, 0)),
((0, 0, 0, 1, 0), (0, 1, 0, 0, 0, 0, 0)),
((0, 0, 1, 0, 0), (0, 0, 0, 0, 1, 0, 0)),
((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)),
((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)),
((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)),
((1, 0, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)),
((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)),
((0, 0, 0, 1, 0), (0, 0, 0, 1, 0, 0, 0)),
((1, 0, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)),
((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)),
((0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)),
((0, 0, 1, 0, 0), (0, 0, 0, 1, 0, 0, 0)),
((0, 0, 1, 0),),
((1, 0, 0, 0),),
((0, 1, 0, 0),),
((0, 0, 0, 1, 0), (1, 0, 0, 0)),
((1, 0, 0, 0, 0), (1, 0, 0, 0)),
((0, 0, 0, 0, 1), (0, 1, 0, 0)),
((0, 0, 0, 1, 0), (0, 0, 0, 1)),
((0, 0, 0, 0, 1), (1, 0, 0, 0)),
((0, 0, 1, 0, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)),
((0, 0, 1, 0), (0, 1, 0, 0, 0, 0)),
((1, 0, 0, 0), (0, 0, 0, 1, 0, 0)),
((0, 0, 0, 1), (0, 0, 0, 0, 0, 1)),
((0, 1, 0, 0), (0, 1, 0, 0, 0, 0)),
((0, 1, 0, 0), (1, 0, 0, 0, 0, 0)),
((1, 0, 0, 0), (0, 1, 0, 0, 0, 0)),
((0, 0, 1, 0), (0, 0, 0, 1, 0, 0)),
((0, 1, 0, 0), (0, 0, 1, 0, 0, 0)),
((0, 0, 1, 0), (0, 0, 0, 0, 1, 0)),
((0, 0, 1, 0), (1, 0, 0, 0, 0, 0)),
((0, 1, 0, 0), (0, 0, 0, 0, 1, 0)),
((1, 0, 0, 0), (0, 0, 1, 0, 0, 0)),
((1, 0, 0, 0), (0, 0, 0, 0, 1, 0)),
((1, 0, 0, 0), (1, 0, 0, 0, 0, 0)),
((0, 1, 0, 0), (0, 0, 0, 1, 0, 0)),
((0, 0, 0, 1),)
]
Each tuple has as many positions as number of categories for a given omega distribution. A one will correspond to the specific omega value that a given nucleotide was multiplied for in order to obtain its substitution rate. A nucleotide might have several omega tuples assign if it is involved in more than one reading frame.
total_omegas
is an actual dictionary with the values for each key:
{
((0, 0, 1, 0, 0),): 0.7209957283986081,
((0, 1, 0, 0, 0),): 0.42077177796287807,
((1, 0, 0, 0, 0),): 0.17314944359111226,
((0, 0, 0, 1, 0),): 1.4050830500389533,
((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927,
((0, 0, 0, 1, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824,
((0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927,
((1, 0, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927,
((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954,
((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824,
((0, 0, 0, 1, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927,
((0, 0, 1, 0, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217,
((1, 0, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284,
((1, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954,
((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954,
((0, 1, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217,
((0, 0, 0, 1, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284,
((0, 0, 1, 0, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284,
((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824,
((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824,
((0, 0, 0, 1, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217,
((0, 0, 1, 0, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171,
((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171,
((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954,
((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284,
((1, 0, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217,
((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171,
((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171,
((0, 1, 0, 0),): 0.4810288100937172,
((0, 0, 1, 0),): 1.1481358615121404,
((1, 0, 0, 0),): 0.1708353283825978,
((0, 1, 0, 0, 0), (0, 0, 0, 1)): 0.4790470356727463,
((0, 0, 0, 1, 0), (0, 1, 0, 0)): 0.38254280439302735,
((1, 0, 0, 0, 0), (0, 1, 0, 0)): 0.38254280439302735,
((0, 0, 0, 1, 0), (1, 0, 0, 0)): 0.1204513150144901,
((0, 0, 1, 0, 0), (0, 0, 1, 0)): 0.9970058805432849,
((0, 0, 0, 0, 1), (0, 1, 0, 0)): 0.38254280439302735,
((0, 0, 0, 0, 1), (1, 0, 0, 0)): 0.1204513150144901,
((0, 0, 1, 0, 0), (0, 0, 0, 1)): 0.7976433287340154,
((0, 0, 1, 0, 0), (1, 0, 0, 0)): 0.1204513150144901,
((0, 1, 0, 0, 0), (0, 1, 0, 0)): 0.38254280439302735,
((1, 0, 0, 0, 0), (1, 0, 0, 0)): 0.1204513150144901,
((0, 0, 0, 1, 0), (0, 0, 0, 1)): 1.5079291778341823,
((1, 0, 0, 0, 0), (0, 0, 0, 1)): 0.20738045775300373,
((0, 1, 0, 0, 0), (0, 0, 1, 0)): 0.9970058805432849,
((1, 0, 0, 0, 0), (0, 0, 1, 0)): 0.9970058805432849,
((0, 1, 0, 0, 0), (1, 0, 0, 0)): 0.1204513150144901,
((0, 0, 0, 0, 1), (0, 0, 1, 0)): 0.9970058805432849,
((0, 0, 1, 0, 0), (0, 1, 0, 0)): 0.38254280439302735,
((0, 0, 0, 1, 0), (0, 0, 1, 0)): 0.9970058805432849,
((1, 0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((0, 0, 0, 1, 0), (1, 0, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((1, 0, 0, 0, 0), (0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998,
((0, 0, 0, 0, 1), (0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998,
((0, 0, 0, 0, 1), (0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((0, 0, 0, 0, 1), (0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 1, 0, 0, 0), (0, 0, 1, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771,
((0, 0, 1, 0, 0), (0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 0, 0, 1, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): 1.5079291778341823,
((1, 0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998,
((0, 0, 0, 1, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 0, 1, 0, 0), (1, 0, 0, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639,
((0, 0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((0, 0, 0, 1, 0), (0, 0, 1, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639,
((1, 0, 0, 0, 0), (1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998,
((0, 1, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 1, 0, 0, 0), (0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998,
((0, 0, 1, 0, 0), (0, 1, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771,
((0, 0, 0, 0, 1), (1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998,
((0, 0, 0, 0, 1), (1, 0, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771,
((1, 0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 1, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): 0.4790470356727463,
((0, 0, 0, 1, 0), (0, 0, 1, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771,
((1, 0, 0, 0, 0), (0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 0, 0, 1, 0), (0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((0, 0, 0, 0, 1), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495,
((0, 0, 1, 0, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): 0.7976433287340154,
((0, 0, 0, 1, 0), (1, 0, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771,
((0, 1, 0, 0, 0), (1, 0, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((1, 0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639,
((1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): 0.1846759438880899,
((0, 1, 0, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639,
((1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): 1.636423967437797,
((1, 0, 0, 0), (0, 0, 0, 1, 0, 0)): 0.9365407795137214,
((1, 0, 0, 0), (0, 0, 1, 0, 0, 0)): 0.6343987842679637,
((0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162,
((0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.1846759438880899,
((0, 0, 1, 0), (0, 1, 0, 0, 0, 0)): 0.4079605248732454,
((1, 0, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771,
((0, 0, 1, 0), (0, 0, 1, 0, 0, 0)): 0.6343987842679637,
((0, 1, 0, 0), (0, 1, 0, 0, 0, 0)): 0.4079605248732454,
((0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): 1.636423967437797,
((0, 0, 1, 0), (0, 0, 0, 1, 0, 0)): 0.9365407795137214,
((0, 1, 0, 0), (1, 0, 0, 0, 0, 0)): 0.1846759438880899,
((0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): 1.636423967437797
}
Now I am not sure my reasoning behind that decision, I now think it was a dumb idea of me. :/
Working on reproducing this issue, adding print
statements to sequence_info.py
# Create event tree containing all possible mutations
self.event_tree = self.create_event_tree() # Nested dict containing info about all possible mutation events
print(self.event_tree)
art@Wernstrom hexse % python3 -m hexse.run_simulation tests/fixtures/NC_003977.2_HBV.gb tests/fixtures/100_tree.newick tests/fixtures/test_HBV.yaml
Started at: 2022-04-21 09:20:38.126209
Omitted orf: [1375, 1840] in [], has 3 STOP codons
Creating root sequence
{'to_nt': {'A': {'from_nt': {'A': None, 'C': {'category': {'mu1': {}, 'mu2': {}}}, 'G': {'category': {'mu1': {}, 'mu2': {}}}, 'T': {'category': {'mu1': {}, 'mu2': {}}}}}, 'C': {'from_nt': {'A': {'category': {'mu1': {}, 'mu2': {}}}, 'C': None, 'G': {'category': {'mu1': {}, 'mu2': {}}}, 'T': {'category': {'mu1': {}, 'mu2': {}}}}}, 'G': {'from_nt': {'A': {'category': {'mu1': {}, 'mu2': {}}}, 'C': {'category': {'mu1': {}, 'mu2': {}}}, 'G': None, 'T': {'category': {'mu1': {}, 'mu2': {}}}}}, 'T': {'from_nt': {'A': {'category': {'mu1': {}, 'mu2': {}}}, 'C': {'category': {'mu1': {}, 'mu2': {}}}, 'G': {'category': {'mu1': {}, 'mu2': {}}}, 'T': None}}}}
Displaying probability tree prior to loading events:
# Create probability tree with the probabilities for each branch
self.probability_tree = self.create_probability_tree()
print(self.probability_tree)
Output (truncated):
{ 'to_nt': {
'A': {
'from_nt': {
'A': None,
'C': { 'cat': { 'mu1': { 'number_of_events': 0,
'omega': { ((0, 0, 1),): { 'number_of_events': 0,
'prob': 1},
((0, 0, 1), (0, 0, 1)): { 'number_of_events': 0,
'prob': 1.2024609042706574e-58},
((0, 1, 0),): { 'number_of_events': 0,
'prob': 0.004280221602341541},
((0, 1, 0), (0, 1, 0)): { 'number_of_events': 0,
'prob': 5.52077070687705e-21},
((0, 1, 0), (1, 0, 0)): { 'number_of_events': 0,
'prob': 1.2024609042706574e-58},
((1, 0, 0),): { 'number_of_events': 0,
'prob': 0.06938332027600537},
((1, 0, 0), (0, 1, 0)): { 'number_of_events': 0,
'prob': 5.521327258225851e-32},
((1, 0, 0), (1, 0, 0)): { 'number_of_events': 0,
'prob': 6.436270260760223e-13}},
'prob': 0.15865525393132512},
'mu2': { 'number_of_events': 0,
'omega': { ((0, 0, 1),): { 'number_of_events': 0,
'prob': 0.001195682148927411},
((0, 0, 1), (0, 0, 1)): { 'number_of_events': 0,
'prob': 0.001195682148927411},
((0, 1, 0),): { 'number_of_events': 0,
'prob': 0.24837368907275478},
((0, 1, 0), (0, 1, 0)): { 'number_of_events': 0,
'prob': 1.6067401427553765e-56},
((0, 1, 0), (1, 0, 0)): { 'number_of_events': 0,
'prob': 1.3779184773170804e-42},
((1, 0, 0),): { 'number_of_events': 0,
'prob': 0.001195682148927411},
((1, 0, 0), (0, 1, 0)): { 'number_of_events': 0,
'prob': 1.5422301917552726e-21},
((1, 0, 0), (1, 0, 0)): { 'number_of_events': 0,
'prob': 1.7979754721702588e-13}},
'prob': 0.8413447460686749}},
'number_of_events': 0,
'prob': 0.18749999999999997},
At the from_nt
level, probabilities sum to one.
At the cat
level, probabilities also sum to one.
However something funny is going on in at the omega
level.
Displaying omega
keys in sequence_info.py:create_probability_tree()
:
# extract keys (one-hot tuples) for omega on this branch of event_tree
omegas = self.event_tree['to_nt'][to_nt]['from_nt'][from_nt]['category'][mu_cat].keys()
print(omegas)
sys.exit()
dict_keys([((0, 1, 0),), ((0, 0, 1),), ((1, 0, 0),), ((0, 1, 0), (0, 1, 0)), ((0, 0, 1), (0, 0, 1)), ((0, 1, 0), (1, 0, 0)), ((1, 0, 0), (1, 0, 0)), ((1, 0, 0), (0, 1, 0))])
Pretty:
[
((0, 1, 0), ), # nucleotide is only in one codon
((0, 0, 1), ), # last position represents synonymous substitution
((1, 0, 0), ),
((0, 1, 0), (0, 1, 0)), # nucleotide is in two codons (two reading frames)
((0, 0, 1), (0, 0, 1)), # synonymous in both codons
((0, 1, 0), (1, 0, 0)),
((1, 0, 0), (1, 0, 0)),
((1, 0, 0), (0, 1, 0))
]
print(self.total_omegas)
{
((1, 0, 0), ): 0.2969606107777804,
((0, 1, 0), ): 1.0630393892187806,
((0, 1, 0), (1, 0, 0)): 0.1691149716766538,
((1, 0, 0), (0, 1, 0)): 0.7908850282996809,
((0, 1, 0), (0, 1, 0)): 0.7908850282996809,
((1, 0, 0), (1, 0, 0)): 0.1691149716766538
}
NOTE last position of tuple is always zero, because that indicates synonymous substitution!
Need to investigate how these values are being calculated. Also why don't we ever see ( , (1, 0, 0))
?
@GopiGugan
diff --git a/hexse/sequence_info.py b/hexse/sequence_info.py
index 9457726..c39c89c 100644
--- a/hexse/sequence_info.py
+++ b/hexse/sequence_info.py
@@ -132,14 +132,19 @@ class Sequence:
# extract keys (one-hot tuples) for omega on this branch of event_tree
combos = self.event_tree['to_nt'][to_nt]['from_nt'][from_nt]['category'][mu_cat].keys()
- omega_p = 1
- nonsyn_values = []
+ # nonsyn_values = []
+ # omega_p = 1
+
for combo in combos:
# a combo is a tuple of length = maximum number of overlapping reading frames (?)
# each member of the tuple is a one-hot encoding of non-synonymous or synonymous categories
# (1, 0, 0) = non-synonymous, first omega of two categories
# (0, 1, 0) = non-synonymous, second omega of two categories
# (0, 0, 1) = synonymous (always last position)
+
+ nonsyn_values = []
+ omega_p = 1
+
for omega in combo:
denominator = 1 + sum(self.total_omegas.values())
nonsyn_values.append(omega[:-1])
Noticed that omega_p
doesn't get reinitialized for each combo
. omega_p
gets multiplied repeatedly for each non-synonymous omega resulting in a very low omega_p
value. nonsyn_values
list also continues to build. Does it make sense to reinitialize these values?
omega_p
values are larger now, but there are some omega combos with a probability of 1:
((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)) = {dict: 2} {'prob': 1, 'number_of_events': 0}
Does this make sense?
@horaciobam
omega
level of the prob_tree
prob_tree['to_nt']['A']['from_nt']['C']['cat']['mu1']['omega']
:
'omega': {
...
((1, 0, 0, 0, 0),): {'prob': 0.002421376104695584, 'number_of_events': 0},
((0, 0, 0, 0, 1),): {'prob': 1, 'number_of_events': 0},
((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): {'prob': 5.258933837951714e-10, 'number_of_events': 0},
((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): {'prob': 1, 'number_of_events': 0},
((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): {'prob': 8.271325371426217e-07, 'number_of_events': 0},
...
((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)): {'prob': 8.271325371426217e-07, 'number_of_events': 0},
((0, 0, 1, 0),): {'prob': 0.016055891848977766, 'number_of_events': 0},
((0, 1, 0, 0),): {'prob': 0.006726857691680525, 'number_of_events': 0},
((0, 0, 0, 1),): {'prob': 1, 'number_of_events': 0},
((0, 1, 0, 0, 0), (0, 0, 0, 1)): {'prob': 0.006699143936855141, 'number_of_events': 0},
....
((0, 0, 1, 0, 0), (0, 0, 1, 0)): {'prob': 2.710295322273438e-06, 'number_of_events': 0},
((1, 0, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): {'prob': 6.334914861424047e-15, 'number_of_events': 0},
((1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): {'prob': 1.1984252503273628e-05, 'number_of_events': 0},
....
((0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): {'prob': 1.1984252503273628e-05, 'number_of_events': 0},
((0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): {'prob': 1, 'number_of_events': 0},
((0, 1, 0, 0), (0, 1, 0, 0, 0, 0)): {'prob': 1.8568545300340008e-07, 'number_of_events': 0},
....
((0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): {'prob': 2.246485052922519e-06, 'number_of_events': 0},
((0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): {'prob': 7.959217839350828e-08, 'number_of_events': 0},
((0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): {'prob': 1.1984252503273628e-05, 'number_of_events': 0},
((1, 0, 0, 0),): {'prob': 0.0023890147921022745, 'number_of_events': 0}
}
As I noted during our dev meeting this week, it does not make sense that the probability decreases drastically with the number of reading frames. There also seems to be missing entries for multiple reading frames (when we were assessing the entire dictionary, and not the abbreviated one above). In other words:
((1, 0, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
represents sites that non-synonymous in three reading frames, and with the first omega value for all three. Then we expect to see every combination of omega values, e.g.,
((0, 1, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
((0, 1, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
((0, 0, 1, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
...
((1, 0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0, 0, 0))
The probability entries should also sum to one given the number of reading frames involved. We have probability one for entries that involve a synonymous substitution, which should not be the case. The "omega" value for a synonymous substitution is one. For non-synonymous substitutions, we have omega values that vary around a mean of one. Then we use all of those values to calculate a probability:
where I is all omega values in the first reading frame, including 1
for synonymous substitutions, and so on for J and K.
We need to have another level in the probability tree, above the "substitution" level (the lowest level), that determines which reading frames the substitution will occur in. The probability associated with each outcome should be determined by the length of the sequence contained within that region, normalized by the total sequence length.
From the commit that reinitialized values when iterating through combos, the resulting alignment has now mutations across nucleotides that are part of at least one reading frame:
# Simulate HBV genomes with only 4 ORFs
orfs:
2849,3182:
omega_classes: 3
omega_shape: 1.5
omega_dist: gamma
0,837:
omega_classes: 4
omega_shape: 1.7
omega_dist: gamma
156,837:
omega_classes: 6
omega_shape: 1.2
omega_dist: gamma
1902,2454:
omega_classes: 5
omega_shape: 2.5
omega_dist: gamma
The reason why nucleotides outside reading frames are not mutating, is because they are not being assigned to the Event Tree
, since no omega_keys
are empty tuples:
As @GopiGugan mentioned: In the nt_in_event_tree
function, since nt.omega_keys
are all empty tuples, it breaks out of the if statement and then nt.omega_in_event_tree
is set as an empty dictionary. The result is nucleotides that are not being considered to be prompt to mutations:
>>>>> Nucleotide, position C 100 (in ORF)
* Omega keys {'A': ((0, 1, 0, 0, 0),), 'C': None, 'G': ((0, 1, 0, 0, 0),), 'T': ((0, 0, 1, 0, 0),)}
*Omega in event tree {'A': ((0, 1, 0, 0, 0),), 'G': ((0, 1, 0, 0, 0),), 'T': ((0, 0, 1, 0, 0),)}
>>>>> Nucleotide, position C 1550 (Not in ORF)
* Omega keys {'A': (), 'C': None, 'G': (), 'T': ()}
*Omega in event tree {}
>>>>> Nucleotide, position T 2200 (in ORF)
* Omega keys {'A': ((0, 1, 0, 0, 0, 0),), 'C': ((0, 0, 0, 1, 0, 0),), 'G': ((0, 0, 0, 0, 1, 0),), 'T': None}
*Omega in event tree {'A': ((0, 1, 0, 0, 0, 0),), 'C': ((0, 0, 0, 1, 0, 0),), 'G': ((0, 0, 0, 0, 1, 0),)}
>>>>> Nucleotide, position C 2600 (Not in ORF)
* Omega keys {'A': (), 'C': None, 'G': (), 'T': ()}
*Omega in event tree {}
Run for HBV genome: