PoonLab / HexSE

Simulation of molecular evolution with overlapping reading frames
GNU General Public License v3.0
4 stars 2 forks source link

Even under high mutation rates (0.5) sites are beeing conserved #109

Closed horaciobam closed 2 years ago

horaciobam commented 2 years ago

Run for HBV genome:

python3 -m src.run_simulation temp/run_2022/NC_003977.2_HBV.gb temp/100_tree.newick --outfile temp/out_hbv_0.5.fa --global_rate 0.5

image

horaciobam commented 2 years ago

First I thought substitution rates are being wrongly calculated on : https://github.com/PoonLab/HexSE/blob/70b6341b8ea9de7356b4ee91d8a93a4dc8b902fb/hexse/sequence_info.py#L351-L404

But printing nucleotide 2450, that didn't mutate, I see that substitution rates for each nucleotide are being properly calculated:

>>>>> T 2450 {'A': 0.0}
>>>>> T 2450 {'A': 0.0, 'C': 0.0032697286483108056}
>>>>> T 2450 {'A': 0.0, 'C': 0.0032697286483108056, 'G': 0.00011542378622683285}
horaciobam commented 2 years ago

Exploring both probablity_tree and event_tree I notice some inconsistencies:

>>>>>>>>>>>>>>>>>>>>> Probability tree {'to_nt': {'A': {'number_of_events': 1270, 'from_nt': {'A': None, 'T': {'prob': 0.18749999999999997, 'cat': {'mu1': {'prob': 0.15865525393132512, 'omega': {((0, 0, 0, 1, 0),): {'prob': 0.0189688736429855, 'number_of_events': 3}, ((0, 1, 0, 0, 0),): {'prob': 6.120879748328122e-07, 'number_of_events': 5}, ((1, 0, 0, 0, 0),): {'prob': 7.818002314774208e-15, 'number_of_events': 3}, ((0, 0, 1, 0, 0),): {'prob': 7.017537737073388e-23, 'number_of_events': 2}, ((0, 0, 0, 0, 1),): {'prob': 7.017537737073388e-23, 'number_of_events': 4}, ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): {'prob': 4.5132776109673465e-57, 'number_of_events': 4}, ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): {'prob': 1.0701792730523098e-103, 'number_of_events': 5}, ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 2.6236224917117375e-137, 'number_of_events': 3}, ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): {'prob': 2.722953835690513e-184, 'number_of_events': 10}, ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 4.706221643657749e-232, 'number_of_events': 5}, ((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): {'prob': 6.7715449301244425e-307, 'number_of_events': 1}, ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): {'prob': 0.0, 'number_of_events': 2}, ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 0.0, 'number_of_events': 6}, ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): {'prob': 0.0, 'number_of_events': 14},

  1. There are some omega keys such as: ((0, 1, 0, 0, 0),): {'prob': 6.120879748328122e-07, 'number_of_events': 5}, with a None value. I think it happens when I remove the key if no more nucleotides are assigned to that branch. I haven't been able to prove this yet.
  2. Some probabilities for probablities are zero: ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {'prob': 0.0, 'number_of_events': 6}. This should not happen for a branch with 6 events.

>>>>>>>>>>>>>>>>>>>>>EVENT TREE: {'to_nt': {'A': {'from_nt': {'A': None, 'T': {'category': {'mu1': {((0, 0, 0, 1, 0),): [t3, t75, t97, t108, t2, t127], ((0, 1, 0, 0, 0),): [t12, t51], ((0, 0, 1, 0, 0),): [t13, t55, t74, t85, t128], ((1, 0, 0, 0, 0),): [t32, t136], ((0, 0, 0, 0, 1),): [t62, t65, t95, t137], ((0, 0, 1, 0, 0), (0, 0, 0, 0, 1, 0, 0)): [t166, t336, t381, t529, t593], ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): [t171, t280, t369, t394, t677], ((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): [t178, t379, t385, t832], ((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): [t181, t423, t615, t637, t811], ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): [t191, t266, t269, t290, t302, t347, t365, t392, t443, t461, t473, t533, t539, t581, t605, t626, t662, t713, t722, t809, t821], ((0, 0, 0, 1, 0), (0, 0, 0, 1, 0, 0, 0)): [t210, t267, t750], ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): [t211, t212, t277, t516, t730, t771], ((1, 0, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): [t214, t271, t360, t621], ((1, 0, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): [t219, t255, t261, t582], ((0, 1, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): [t238, t418, t673, t714, t732, t736], ((0, 0, 1, 0, 0), (0, 1, 0, 0, 0, 0, 0)): [t248, t553, t640, t648, t654], ((0, 0, 1, 0, 0), (0, 0, 0, 1, 0, 0, 0)): [t258, t332, t421, t484], ((1, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): [t275, t685, t756], ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): [t278, t789], ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): [t318, t627], ((0, 0, 0, 1, 0), (0, 0, 0, 0, 0, 1, 0)): [t327, t345, t702, t778, t801], ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): [t348, t415, t741, t799, t826], ((0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): [t375, t397, t703, t808], ((0, 0, 0, 1, 0), (0, 1, 0, 0, 0, 0, 0)): [t393, t767, t790], ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): [t400, t555], ((1, 0, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): [t406], ((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): [t432, t664, t692, t745, t815], ((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)): [t435, t549, t663], ((0, 0, 0, 1, 0), (0, 0, 1, 0, 0, 0, 0)): [t561, t570, t691, t739], ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): [t682], ((1, 0, 0, 0),): [t1822, t1833, t1837, t2864, t2889, t2988, t3003, t3036, t3066], ((0, 0, 0, 1),): [t1823, t1847, t2878, t3052, t3094, t3097, t3148], ((0, 0, 1, 0),): [t1825, t1831, t2862, t2885, t2886, t2888, t2915, t2916, t2935, t2942, t3014, t3116, t3140] I notice that on the event tree nucleotides on the tips are grouped by regions of the genome: [t178, t379, t385, t832], [t3, t75, t97, t108, t2, t127]or [t1822, t1833, t1837, t2864, t2889, t2988, t3003, t3036, t3066]. I wonder if this is due to the different omegas for each reading frame:

orfs: 
  2849,3182:
    omega_classes: 3
    omega_shape: 1.5
    omega_dist: gamma

  0,837:
    omega_classes: 4
    omega_shape: 1.7
    omega_dist: gamma

  3173,3182:
    omega_classes: 5
    omega_shape: 1.9
    omega_dist: gamma

  156,837:
    omega_classes: 6
    omega_shape: 1.2
    omega_dist: gamma

  1375,1840:
    omega_classes: 7
    omega_shape: 1.30
    omega_dist: gamma

  1815,2454:
    omega_classes: 3
    omega_shape: 1.25
    omega_dist: gamma

  1853,1922:
    omega_classes: 4
    omega_shape: 1.87
    omega_dist: gamma

  1902,2454:
    omega_classes: 5
    omega_shape: 2.5
    omega_dist: gamma

Should omegas be similar for different reading frames or it is ok to have different paremeters of the gamma for each ORF

ArtPoon commented 2 years ago

Reformatting probability tree example from above to emphasize hierarchy:

{
  'to_nt': {
    'A': {
      'number_of_events': 1270,
      'from_nt': {
        'A': None,
        'T': {
          'prob': 0.18749999999999997,
          'cat': {
            'mu1': {
              'prob': 0.15865525393132512,
              'omega': {
                ((0, 0, 0, 1, 0),): {
                  'prob': 0.0189688736429855,
                  'number_of_events': 3
                },
                ((0, 1, 0, 0, 0),): {
                  'prob': 6.120879748328122e-07, 
                  'number_of_events': 5
                }, 
                ((1, 0, 0, 0, 0),): {
                  'prob': 7.818002314774208e-15, 
                  'number_of_events': 3
                }, 
                ((0, 0, 1, 0, 0),): {
                  'prob': 7.017537737073388e-23, 
                  'number_of_events': 2
                }, 
                ((0, 0, 0, 0, 1),): {
                  'prob': 7.017537737073388e-23, 
                  'number_of_events': 4
                }, 
                ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): {
                  'prob': 4.5132776109673465e-57, 
                  'number_of_events': 4
                }, 
                ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): {
                  'prob': 1.0701792730523098e-103, 
                  'number_of_events': 5
                }, 
                ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {
                  'prob': 2.6236224917117375e-137, 
                  'number_of_events': 3
                }, 
                ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): {
                  'prob': 2.722953835690513e-184, 
                  'number_of_events': 10
                }, 
                ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {
                  'prob': 4.706221643657749e-232, 
                  'number_of_events': 5
                }, 
                ((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): {
                  'prob': 6.7715449301244425e-307, 
                  'number_of_events': 1
                }, 
                ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): {
                  'prob': 0.0, 
                  'number_of_events': 2
                }, 
                ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): {
                  'prob': 0.0, 
                  'number_of_events': 6
                }, 
                ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): {
                  'prob': 0.0, 
                  'number_of_events': 14
                },
ArtPoon commented 2 years ago

I think there's something wrong with how probabilities are being calculated for the discretized gamma distribution. Is this distribution actually being discretized? (we should be integrating the probability density from one value at the left of a rate category to the right, not taking the pdf at a single point)

horaciobam commented 2 years ago

When calculating the values omega values for each orf, we are getting the following dict:

{
    '+': [
        {
        'coords': [[2849, 3182]], 
        'omega_shape': 1.5, 
        'omega_classes': 3, 
        'omega_values': [0.1708353283825978, 0.4810288100937172, 1.1481358615121404]
        }, 
        {
        'coords': [[0, 837]], 
        'omega_shape': 1.7, 
        'omega_classes': 4, 
        'omega_values': [0.17314944359111226, 0.42077177796287807, 0.7209957283986081, 1.4050830500389533]
        }, 
        {
        'coords': [[3173, 3182]], 
        'omega_shape': 1.9, 
        'omega_classes': 5, 
        'omega_values': [0.1846759438880899, 0.4079605248732454, 0.6343987842679637, 0.9365407795137214, 1.636423967437797]
        }, 
    ], 
    '-': []
}

The values do not sum one because the omega_values are being drawn from a discretized gamma distribution as following:

def discretize(alpha, ncat, dist):
    """
    Divide a distribution into a number of intervals with equal probability and get the mid point of those intervals
    From https://gist.github.com/kgori/95f604131ce92ec15f4338635a86dfb9
    :param alpha: shape parameter
    :param ncat: Number of categories
    :param dist: distribution of probabilities, can be a string
    :return: array with ncat number of values
    """

    # Handle if distribution was specified from input file (parsed as a string)
    if dist == 'ss.gamma' or dist == 'gamma':
        dist = ss.gamma
    elif dist == 'ss.lognorm' or dist == 'lognorm':
        dist = ss.lognorm

    if dist == ss.gamma:
        dist = dist(alpha, scale=0.4)
    elif dist == ss.lognorm:
        dist = dist(s=alpha, scale=0.5)  # scale=np.exp(0.05 * alpha**2)

    quantiles = dist.ppf(np.arange(0, ncat) / ncat)
    rates = np.zeros(ncat, dtype=np.double)

    for i in range(ncat-1):
        rates[i] = (ncat * scipy.integrate.quad(lambda x: x * dist.pdf(x), quantiles[i], quantiles[i+1])[0])

    rates[ncat-1] = ncat * scipy.integrate.quad(lambda x: x * dist.pdf(x), quantiles[ncat-1], np.inf)[0]

    return rates

How should this function work instead?

horaciobam commented 2 years ago

This not solve the problem of some events having very low (even zero) probabilities in very specific parts of the sequence.

When we are selecting the mutation, I think that we are somehow also modifying the event tree due to the implementation of a deep.copy:

https://github.com/PoonLab/HexSE/blob/70b6341b8ea9de7356b4ee91d8a93a4dc8b902fb/hexse/simulation.py#L66-L94

ArtPoon commented 2 years ago

Why is there something being raised to the power of something else here? https://github.com/PoonLab/HexSE/blob/ec9adb193d19ab21b9e245316d7421c5405a8bc0/hexse/sequence_info.py#L443

ArtPoon commented 2 years ago

This denominator is a potential cause of vanishingly small probabilities: https://github.com/PoonLab/HexSE/blob/ec9adb193d19ab21b9e245316d7421c5405a8bc0/hexse/sequence_info.py#L140

ArtPoon commented 2 years ago

I think total_omegas are actual omega values (dN/dS), not probabilities or rates.

GopiGugan commented 2 years ago

Wondering what the difference between omegas and self.total_omegas is:

https://github.com/PoonLab/HexSE/blob/ec9adb193d19ab21b9e245316d7421c5405a8bc0/hexse/sequence_info.py#L138-L140

Need to check if synonymous mutations are being miscalculated and how synonymous mutations are being stored:

https://github.com/PoonLab/HexSE/blob/ec9adb193d19ab21b9e245316d7421c5405a8bc0/hexse/sequence_info.py#L143-L159

ArtPoon commented 2 years ago

https://github.com/PoonLab/HexSE/blob/36c257f7fee3b760c03f8ba9095584dfee018dcd/hexse/sequence_info.py#L174-L192 I'm wondering if line 175 should be inside the loop - otherwise we keep appending entries to nonsyn_values

ArtPoon commented 2 years ago

This code block needs better documentation!

horaciobam commented 2 years ago

@GopiGugan omegas is a list of the keys for omega values available on the tree:

dict_keys(
    [
        ((0, 0, 1, 0, 0),), 
        ((1, 0, 0, 0, 0),), 
        ((0, 0, 0, 0, 1),), 
        ((0, 1, 0, 0, 0),), 
        ((0, 0, 0, 1, 0),), 
        ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)), 
        ((0, 0, 0, 1, 0), (0, 0, 0, 0, 0, 1, 0)), 
        ((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)), 
        ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)), 
        ((1, 0, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)), 
        ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)), 
        ((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)), 
        ((0, 0, 1, 0, 0), (0, 1, 0, 0, 0, 0, 0)), 
        ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)), 
        ((1, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)), 
        ((0, 0, 0, 1, 0), (0, 0, 1, 0, 0, 0, 0)), 
        ((0, 0, 0, 1, 0), (0, 1, 0, 0, 0, 0, 0)), 
        ((0, 0, 1, 0, 0), (0, 0, 0, 0, 1, 0, 0)), 
        ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)), 
        ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)), 
        ((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)), 
        ((1, 0, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)), 
        ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)), 
        ((0, 0, 0, 1, 0), (0, 0, 0, 1, 0, 0, 0)), 
        ((1, 0, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)), 
        ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)), 
        ((0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)), 
        ((0, 0, 1, 0, 0), (0, 0, 0, 1, 0, 0, 0)), 
        ((0, 0, 1, 0),), 
        ((1, 0, 0, 0),), 
        ((0, 1, 0, 0),), 
        ((0, 0, 0, 1, 0), (1, 0, 0, 0)), 
        ((1, 0, 0, 0, 0), (1, 0, 0, 0)), 
        ((0, 0, 0, 0, 1), (0, 1, 0, 0)), 
        ((0, 0, 0, 1, 0), (0, 0, 0, 1)), 
        ((0, 0, 0, 0, 1), (1, 0, 0, 0)), 
        ((0, 0, 1, 0, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)), 
        ((0, 0, 1, 0), (0, 1, 0, 0, 0, 0)), 
        ((1, 0, 0, 0), (0, 0, 0, 1, 0, 0)), 
        ((0, 0, 0, 1), (0, 0, 0, 0, 0, 1)), 
        ((0, 1, 0, 0), (0, 1, 0, 0, 0, 0)), 
        ((0, 1, 0, 0), (1, 0, 0, 0, 0, 0)), 
        ((1, 0, 0, 0), (0, 1, 0, 0, 0, 0)), 
        ((0, 0, 1, 0), (0, 0, 0, 1, 0, 0)), 
        ((0, 1, 0, 0), (0, 0, 1, 0, 0, 0)), 
        ((0, 0, 1, 0), (0, 0, 0, 0, 1, 0)), 
        ((0, 0, 1, 0), (1, 0, 0, 0, 0, 0)), 
        ((0, 1, 0, 0), (0, 0, 0, 0, 1, 0)), 
        ((1, 0, 0, 0), (0, 0, 1, 0, 0, 0)), 
        ((1, 0, 0, 0), (0, 0, 0, 0, 1, 0)), 
        ((1, 0, 0, 0), (1, 0, 0, 0, 0, 0)), 
        ((0, 1, 0, 0), (0, 0, 0, 1, 0, 0)), 
        ((0, 0, 0, 1),)

        ]

Each tuple has as many positions as number of categories for a given omega distribution. A one will correspond to the specific omega value that a given nucleotide was multiplied for in order to obtain its substitution rate. A nucleotide might have several omega tuples assign if it is involved in more than one reading frame.

total_omegas is an actual dictionary with the values for each key:

 {
     ((0, 0, 1, 0, 0),): 0.7209957283986081, 
     ((0, 1, 0, 0, 0),): 0.42077177796287807, 
     ((1, 0, 0, 0, 0),): 0.17314944359111226, 
     ((0, 0, 0, 1, 0),): 1.4050830500389533, 
     ((0, 0, 1, 0, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927, 
     ((0, 0, 0, 1, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824, 
     ((0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927, 
     ((1, 0, 0, 0, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927, 
     ((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954, 
     ((0, 1, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824, 
     ((0, 0, 0, 1, 0), (0, 0, 1, 0, 0, 0, 0)): 0.2853664305928927, 
     ((0, 0, 1, 0, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217, 
     ((1, 0, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284, 
     ((1, 0, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954, 
     ((0, 0, 1, 0, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954, 
     ((0, 1, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217, 
     ((0, 0, 0, 1, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284, 
     ((0, 0, 1, 0, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284, 
     ((0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824, 
     ((1, 0, 0, 0, 0), (0, 0, 0, 0, 0, 1, 0)): 1.2609093415549824, 
     ((0, 0, 0, 1, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217, 
     ((0, 0, 1, 0, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171, 
     ((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171, 
     ((0, 0, 0, 1, 0), (1, 0, 0, 0, 0, 0, 0)): 0.05771979106727954, 
     ((0, 1, 0, 0, 0), (0, 0, 0, 1, 0, 0, 0)): 0.44049638575624284, 
     ((1, 0, 0, 0, 0), (0, 1, 0, 0, 0, 0, 0)): 0.1642586933267217, 
     ((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171, 
     ((0, 1, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): 0.6712493575878171, 
     ((0, 1, 0, 0),): 0.4810288100937172, 
     ((0, 0, 1, 0),): 1.1481358615121404, 
     ((1, 0, 0, 0),): 0.1708353283825978, 
     ((0, 1, 0, 0, 0), (0, 0, 0, 1)): 0.4790470356727463, 
     ((0, 0, 0, 1, 0), (0, 1, 0, 0)): 0.38254280439302735, 
     ((1, 0, 0, 0, 0), (0, 1, 0, 0)): 0.38254280439302735, 
     ((0, 0, 0, 1, 0), (1, 0, 0, 0)): 0.1204513150144901, 
     ((0, 0, 1, 0, 0), (0, 0, 1, 0)): 0.9970058805432849, 
     ((0, 0, 0, 0, 1), (0, 1, 0, 0)): 0.38254280439302735, 
     ((0, 0, 0, 0, 1), (1, 0, 0, 0)): 0.1204513150144901, 
     ((0, 0, 1, 0, 0), (0, 0, 0, 1)): 0.7976433287340154, 
     ((0, 0, 1, 0, 0), (1, 0, 0, 0)): 0.1204513150144901, 
     ((0, 1, 0, 0, 0), (0, 1, 0, 0)): 0.38254280439302735, 
     ((1, 0, 0, 0, 0), (1, 0, 0, 0)): 0.1204513150144901, 
     ((0, 0, 0, 1, 0), (0, 0, 0, 1)): 1.5079291778341823, 
     ((1, 0, 0, 0, 0), (0, 0, 0, 1)): 0.20738045775300373, 
     ((0, 1, 0, 0, 0), (0, 0, 1, 0)): 0.9970058805432849, 
     ((1, 0, 0, 0, 0), (0, 0, 1, 0)): 0.9970058805432849, 
     ((0, 1, 0, 0, 0), (1, 0, 0, 0)): 0.1204513150144901, 
     ((0, 0, 0, 0, 1), (0, 0, 1, 0)): 0.9970058805432849, 
     ((0, 0, 1, 0, 0), (0, 1, 0, 0)): 0.38254280439302735, 
     ((0, 0, 0, 1, 0), (0, 0, 1, 0)): 0.9970058805432849, 
     ((1, 0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((0, 0, 0, 1, 0), (1, 0, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((1, 0, 0, 0, 0), (0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998, 
     ((0, 0, 0, 0, 1), (0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998, 
     ((0, 0, 0, 0, 1), (0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((0, 0, 0, 0, 1), (0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 1, 0, 0, 0), (0, 0, 1, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771, 
     ((0, 0, 1, 0, 0), (0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 0, 0, 1, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): 1.5079291778341823, 
     ((1, 0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998, 
     ((0, 0, 0, 1, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 0, 1, 0, 0), (1, 0, 0, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639, 
     ((0, 0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((0, 0, 0, 1, 0), (0, 0, 1, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639, 
     ((1, 0, 0, 0, 0), (1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998, 
     ((0, 1, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 1, 0, 0, 0), (0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998, 
     ((0, 0, 1, 0, 0), (0, 1, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771, 
     ((0, 0, 0, 0, 1), (1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): 1.9969734953254998, 
     ((0, 0, 0, 0, 1), (1, 0, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771, 
     ((1, 0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 1, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): 0.4790470356727463, 
     ((0, 0, 0, 1, 0), (0, 0, 1, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771, 
     ((1, 0, 0, 0, 0), (0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 0, 0, 1, 0), (0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((0, 0, 0, 0, 1), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): 0.3075971532102495, 
     ((0, 0, 1, 0, 0), (0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): 0.7976433287340154, 
     ((0, 0, 0, 1, 0), (1, 0, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771, 
     ((0, 1, 0, 0, 0), (1, 0, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((1, 0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639, 
     ((1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): 0.1846759438880899, 
     ((0, 1, 0, 0), (0, 0, 1, 0, 0, 0)): 0.8730832667988639, 
     ((1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): 1.636423967437797, 
     ((1, 0, 0, 0), (0, 0, 0, 1, 0, 0)): 0.9365407795137214, 
     ((1, 0, 0, 0), (0, 0, 1, 0, 0, 0)): 0.6343987842679637, 
     ((0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): 1.2223993729945162, 
     ((0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): 0.1846759438880899, 
     ((0, 0, 1, 0), (0, 1, 0, 0, 0, 0)): 0.4079605248732454, 
     ((1, 0, 0, 0), (0, 1, 0, 0, 0, 0)): 0.5999467116400771, 
     ((0, 0, 1, 0), (0, 0, 1, 0, 0, 0)): 0.6343987842679637, 
     ((0, 1, 0, 0), (0, 1, 0, 0, 0, 0)): 0.4079605248732454, 
     ((0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): 1.636423967437797, 
     ((0, 0, 1, 0), (0, 0, 0, 1, 0, 0)): 0.9365407795137214, 
     ((0, 1, 0, 0), (1, 0, 0, 0, 0, 0)): 0.1846759438880899, 
     ((0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): 1.636423967437797

     }

Now I am not sure my reasoning behind that decision, I now think it was a dumb idea of me. :/

ArtPoon commented 2 years ago

Working on reproducing this issue, adding print statements to sequence_info.py

        # Create event tree containing all possible mutations
        self.event_tree = self.create_event_tree()  # Nested dict containing info about all possible mutation events
        print(self.event_tree)
art@Wernstrom hexse % python3 -m hexse.run_simulation tests/fixtures/NC_003977.2_HBV.gb tests/fixtures/100_tree.newick tests/fixtures/test_HBV.yaml                    

Started at:  2022-04-21 09:20:38.126209
Omitted orf: [1375, 1840] in [], has 3 STOP codons

Creating root sequence
{'to_nt': {'A': {'from_nt': {'A': None, 'C': {'category': {'mu1': {}, 'mu2': {}}}, 'G': {'category': {'mu1': {}, 'mu2': {}}}, 'T': {'category': {'mu1': {}, 'mu2': {}}}}}, 'C': {'from_nt': {'A': {'category': {'mu1': {}, 'mu2': {}}}, 'C': None, 'G': {'category': {'mu1': {}, 'mu2': {}}}, 'T': {'category': {'mu1': {}, 'mu2': {}}}}}, 'G': {'from_nt': {'A': {'category': {'mu1': {}, 'mu2': {}}}, 'C': {'category': {'mu1': {}, 'mu2': {}}}, 'G': None, 'T': {'category': {'mu1': {}, 'mu2': {}}}}}, 'T': {'from_nt': {'A': {'category': {'mu1': {}, 'mu2': {}}}, 'C': {'category': {'mu1': {}, 'mu2': {}}}, 'G': {'category': {'mu1': {}, 'mu2': {}}}, 'T': None}}}}
ArtPoon commented 2 years ago

Displaying probability tree prior to loading events:

        # Create probability tree with the probabilities for each branch
        self.probability_tree = self.create_probability_tree()
        print(self.probability_tree)

Output (truncated):

{ 'to_nt': { 
  'A': { 
    'from_nt': { 
      'A': None,
      'C': { 'cat': { 'mu1': { 'number_of_events': 0,
                               'omega': { ((0, 0, 1),): { 'number_of_events': 0,
                                                          'prob': 1},
                                          ((0, 0, 1), (0, 0, 1)): { 'number_of_events': 0,
                                                                    'prob': 1.2024609042706574e-58},
                                          ((0, 1, 0),): { 'number_of_events': 0,
                                                          'prob': 0.004280221602341541},
                                          ((0, 1, 0), (0, 1, 0)): { 'number_of_events': 0,
                                                                    'prob': 5.52077070687705e-21},
                                          ((0, 1, 0), (1, 0, 0)): { 'number_of_events': 0,
                                                                    'prob': 1.2024609042706574e-58},
                                          ((1, 0, 0),): { 'number_of_events': 0,
                                                          'prob': 0.06938332027600537},
                                          ((1, 0, 0), (0, 1, 0)): { 'number_of_events': 0,
                                                                    'prob': 5.521327258225851e-32},
                                          ((1, 0, 0), (1, 0, 0)): { 'number_of_events': 0,
                                                                    'prob': 6.436270260760223e-13}},
                               'prob': 0.15865525393132512},
                      'mu2': { 'number_of_events': 0,
                               'omega': { ((0, 0, 1),): { 'number_of_events': 0,
                                                          'prob': 0.001195682148927411},
                                          ((0, 0, 1), (0, 0, 1)): { 'number_of_events': 0,
                                                                    'prob': 0.001195682148927411},
                                          ((0, 1, 0),): { 'number_of_events': 0,
                                                          'prob': 0.24837368907275478},
                                          ((0, 1, 0), (0, 1, 0)): { 'number_of_events': 0,
                                                                    'prob': 1.6067401427553765e-56},
                                          ((0, 1, 0), (1, 0, 0)): { 'number_of_events': 0,
                                                                    'prob': 1.3779184773170804e-42},
                                          ((1, 0, 0),): { 'number_of_events': 0,
                                                          'prob': 0.001195682148927411},
                                          ((1, 0, 0), (0, 1, 0)): { 'number_of_events': 0,
                                                                    'prob': 1.5422301917552726e-21},
                                          ((1, 0, 0), (1, 0, 0)): { 'number_of_events': 0,
                                                                    'prob': 1.7979754721702588e-13}},
                               'prob': 0.8413447460686749}},
             'number_of_events': 0,
             'prob': 0.18749999999999997},

At the from_nt level, probabilities sum to one. At the cat level, probabilities also sum to one. However something funny is going on in at the omega level.

ArtPoon commented 2 years ago

Displaying omega keys in sequence_info.py:create_probability_tree():

                    # extract keys (one-hot tuples) for omega on this branch of event_tree
                    omegas = self.event_tree['to_nt'][to_nt]['from_nt'][from_nt]['category'][mu_cat].keys()
                    print(omegas)
                    sys.exit()
dict_keys([((0, 1, 0),), ((0, 0, 1),), ((1, 0, 0),), ((0, 1, 0), (0, 1, 0)), ((0, 0, 1), (0, 0, 1)), ((0, 1, 0), (1, 0, 0)), ((1, 0, 0), (1, 0, 0)), ((1, 0, 0), (0, 1, 0))])

Pretty:

[
  ((0, 1, 0), ),  # nucleotide is only in one codon
  ((0, 0, 1), ),  # last position represents synonymous substitution
  ((1, 0, 0), ), 
  ((0, 1, 0), (0, 1, 0)),  # nucleotide is in two codons (two reading frames)
  ((0, 0, 1), (0, 0, 1)),  # synonymous in both codons
  ((0, 1, 0), (1, 0, 0)), 
  ((1, 0, 0), (1, 0, 0)), 
  ((1, 0, 0), (0, 1, 0))
]
ArtPoon commented 2 years ago
print(self.total_omegas)
{
  ((1, 0, 0), ): 0.2969606107777804, 
  ((0, 1, 0), ): 1.0630393892187806, 
  ((0, 1, 0), (1, 0, 0)): 0.1691149716766538, 
  ((1, 0, 0), (0, 1, 0)): 0.7908850282996809, 
  ((0, 1, 0), (0, 1, 0)): 0.7908850282996809, 
  ((1, 0, 0), (1, 0, 0)): 0.1691149716766538
}

NOTE last position of tuple is always zero, because that indicates synonymous substitution!

Need to investigate how these values are being calculated. Also why don't we ever see ( , (1, 0, 0))?

ArtPoon commented 2 years ago

@GopiGugan

GopiGugan commented 2 years ago
diff --git a/hexse/sequence_info.py b/hexse/sequence_info.py
index 9457726..c39c89c 100644
--- a/hexse/sequence_info.py
+++ b/hexse/sequence_info.py
@@ -132,14 +132,19 @@ class Sequence:

                     # extract keys (one-hot tuples) for omega on this branch of event_tree
                     combos = self.event_tree['to_nt'][to_nt]['from_nt'][from_nt]['category'][mu_cat].keys()
-                    omega_p = 1
-                    nonsyn_values = []
+                    # nonsyn_values = []
+                    # omega_p = 1
+
                     for combo in combos:
                         # a combo is a tuple of length = maximum number of overlapping reading frames (?)
                         # each member of the tuple is a one-hot encoding of non-synonymous or synonymous categories
                         # (1, 0, 0) = non-synonymous, first omega of two categories
                         # (0, 1, 0) = non-synonymous, second omega of two categories
                         # (0, 0, 1) = synonymous (always last position)
+
+                        nonsyn_values = []
+                        omega_p = 1
+                        
                         for omega in combo:
                             denominator = 1 + sum(self.total_omegas.values())
                             nonsyn_values.append(omega[:-1])

https://github.com/PoonLab/HexSE/blob/bb66bb3cc6bcfcf23717a90af2909e298e2375ee/hexse/sequence_info.py#L143-L158

Noticed that omega_p doesn't get reinitialized for each combo. omega_p gets multiplied repeatedly for each non-synonymous omega resulting in a very low omega_p value. nonsyn_values list also continues to build. Does it make sense to reinitialize these values?

omega_p values are larger now, but there are some omega combos with a probability of 1:

((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)) = {dict: 2} {'prob': 1, 'number_of_events': 0}

Does this make sense?

@horaciobam

GopiGugan commented 2 years ago

omega level of the prob_tree

prob_tree['to_nt']['A']['from_nt']['C']['cat']['mu1']['omega']:

'omega': {
    ...
    ((1, 0, 0, 0, 0),): {'prob': 0.002421376104695584, 'number_of_events': 0}, 
    ((0, 0, 0, 0, 1),): {'prob': 1, 'number_of_events': 0}, 
    ((0, 1, 0, 0, 0), (1, 0, 0, 0, 0, 0, 0)): {'prob': 5.258933837951714e-10, 'number_of_events': 0}, 
    ((0, 0, 0, 0, 1), (0, 0, 0, 0, 0, 0, 1)): {'prob': 1, 'number_of_events': 0}, 
    ((1, 0, 0, 0, 0), (0, 0, 0, 0, 1, 0, 0)): {'prob': 8.271325371426217e-07, 'number_of_events': 0}, 
    ...
    ((0, 0, 0, 1, 0), (0, 0, 0, 0, 1, 0, 0)): {'prob': 8.271325371426217e-07, 'number_of_events': 0}, 
    ((0, 0, 1, 0),): {'prob': 0.016055891848977766, 'number_of_events': 0}, 
    ((0, 1, 0, 0),): {'prob': 0.006726857691680525, 'number_of_events': 0}, 
    ((0, 0, 0, 1),): {'prob': 1, 'number_of_events': 0}, 
    ((0, 1, 0, 0, 0), (0, 0, 0, 1)): {'prob': 0.006699143936855141, 'number_of_events': 0}, 
    ....
    ((0, 0, 1, 0, 0), (0, 0, 1, 0)): {'prob': 2.710295322273438e-06, 'number_of_events': 0}, 
    ((1, 0, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)): {'prob': 6.334914861424047e-15, 'number_of_events': 0}, 
    ((1, 0, 0, 0), (0, 0, 0, 0, 1, 0)): {'prob': 1.1984252503273628e-05, 'number_of_events': 0}, 
    .... 
    ((0, 1, 0, 0), (0, 0, 0, 0, 1, 0)): {'prob': 1.1984252503273628e-05, 'number_of_events': 0}, 
    ((0, 0, 0, 1), (0, 0, 0, 0, 0, 1)): {'prob': 1, 'number_of_events': 0}, 
    ((0, 1, 0, 0), (0, 1, 0, 0, 0, 0)): {'prob': 1.8568545300340008e-07, 'number_of_events': 0}, 
    ....
    ((0, 1, 0, 0), (0, 0, 0, 1, 0, 0)): {'prob': 2.246485052922519e-06, 'number_of_events': 0}, 
    ((0, 0, 1, 0), (1, 0, 0, 0, 0, 0)): {'prob': 7.959217839350828e-08, 'number_of_events': 0}, 
    ((0, 0, 1, 0), (0, 0, 0, 0, 1, 0)): {'prob': 1.1984252503273628e-05, 'number_of_events': 0}, 
    ((1, 0, 0, 0),): {'prob': 0.0023890147921022745, 'number_of_events': 0}
}
ArtPoon commented 2 years ago

As I noted during our dev meeting this week, it does not make sense that the probability decreases drastically with the number of reading frames. There also seems to be missing entries for multiple reading frames (when we were assessing the entire dictionary, and not the abbreviated one above). In other words: ((1, 0, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0)) represents sites that non-synonymous in three reading frames, and with the first omega value for all three. Then we expect to see every combination of omega values, e.g.,

((0, 1, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
((0, 1, 0, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
((0, 0, 1, 0, 0), (1, 0, 0, 0), (1, 0, 0, 0, 0, 0))
...
((1, 0, 0, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0, 0, 0))

The probability entries should also sum to one given the number of reading frames involved. We have probability one for entries that involve a synonymous substitution, which should not be the case. The "omega" value for a synonymous substitution is one. For non-synonymous substitutions, we have omega values that vary around a mean of one. Then we use all of those values to calculate a probability:

image

where I is all omega values in the first reading frame, including 1 for synonymous substitutions, and so on for J and K.

ArtPoon commented 2 years ago

We need to have another level in the probability tree, above the "substitution" level (the lowest level), that determines which reading frames the substitution will occur in. The probability associated with each outcome should be determined by the length of the sequence contained within that region, normalized by the total sequence length.

horaciobam commented 2 years ago

From the commit that reinitialized values when iterating through combos, the resulting alignment has now mutations across nucleotides that are part of at least one reading frame:

image

# Simulate HBV genomes with only 4 ORFs
orfs: 
  2849,3182:
    omega_classes: 3
    omega_shape: 1.5
    omega_dist: gamma

  0,837:
    omega_classes: 4
    omega_shape: 1.7
    omega_dist: gamma

  156,837:
    omega_classes: 6
    omega_shape: 1.2
    omega_dist: gamma

  1902,2454:
    omega_classes: 5
    omega_shape: 2.5
    omega_dist: gamma

The reason why nucleotides outside reading frames are not mutating, is because they are not being assigned to the Event Tree, since no omega_keys are empty tuples:

https://github.com/PoonLab/HexSE/blob/7fc89565a3a97715c29f5f67059bf499eba5a3e5/hexse/sequence_info.py#L461-L495

As @GopiGugan mentioned: In the nt_in_event_tree function, since nt.omega_keys are all empty tuples, it breaks out of the if statement and then nt.omega_in_event_tree is set as an empty dictionary. The result is nucleotides that are not being considered to be prompt to mutations:

>>>>> Nucleotide, position C 100 (in ORF)
* Omega keys {'A': ((0, 1, 0, 0, 0),), 'C': None, 'G': ((0, 1, 0, 0, 0),), 'T': ((0, 0, 1, 0, 0),)} 
*Omega in event tree {'A': ((0, 1, 0, 0, 0),), 'G': ((0, 1, 0, 0, 0),), 'T': ((0, 0, 1, 0, 0),)} 

 >>>>> Nucleotide, position C 1550 (Not in ORF)
* Omega keys {'A': (), 'C': None, 'G': (), 'T': ()} 
*Omega in event tree {} 

 >>>>> Nucleotide, position T 2200 (in ORF)
* Omega keys {'A': ((0, 1, 0, 0, 0, 0),), 'C': ((0, 0, 0, 1, 0, 0),), 'G': ((0, 0, 0, 0, 1, 0),), 'T': None} 
*Omega in event tree {'A': ((0, 1, 0, 0, 0, 0),), 'C': ((0, 0, 0, 1, 0, 0),), 'G': ((0, 0, 0, 0, 1, 0),)} 

 >>>>> Nucleotide, position C 2600 (Not in ORF)
* Omega keys {'A': (), 'C': None, 'G': (), 'T': ()} 
*Omega in event tree {} 
horaciobam commented 2 years ago

Fixed on commit https://github.com/PoonLab/HexSE/commit/57e6fac17c271b4694ec51bed0dd4c3e959a179f