blab / nextflu

Real-time tracking of influenza evolution
http://nextflu.org
GNU Affero General Public License v3.0
88 stars 19 forks source link

Problem with pivot and frequency array lengths #123

Closed trvrb closed 7 years ago

trvrb commented 8 years ago

I've been using interp1d to interpolate frequencies between pivots in fitness_model, like so:

https://github.com/blab/nextflu/blob/continuous_prediction/augur/src/fitness_model.py#L86

This works most of the time. However, for the current 3y time window I'm getting an error because the length of self.rootnode.pivots is not equal to the length of node.freq['global']. For the 3y resolution, pivots is 38 in length and freq is 37 in length.

If I use a different time window, say --interval 2012.15 2015.15. This works fine and I get 38 in length for both arrays.

I've been working through bernoulli_frequency, but I can't find where the issue might be.

To be clear:

run src/H3N2_process.py -v 60 --interval 2012.15 2015.15 --prefix H3N2_ --resolution Feb-2015 --start fitness --stop fitness

works, while

run src/H3N2_process.py -v 60 -y 3 --pref H3N2_ --resolution 3y --start HI --stop fitness --estimate_fitness_model

does not.

It would be possible to hack a fix to shorten pivots by one if the arrays are mismatched, but I don't know which end to slice.

trvrb commented 8 years ago

Rerunning now, everything works just fine. I don't know if this was a Jan 1 issue or what. I'll decrease priority.

huddlej commented 7 years ago

I have experienced the same "array length mismatch" bug as the result of too few pivots compared to frequency entries. It seems like the issue occurs when the requested interval differs in number of pivots from the interval used on the first run of augur as the result of the memoization of global frequencies in calc_node_frequencies (see code).

If I modify the H3N2 process module by adding the following code prior to the call to self.annotate_fitness(), I no longer get mismatch errors between number of pivots and number of frequencies because frequencies are recalculated every time I run the prediction pipeline. As expected, the pipeline takes much longer to run each time.

for node in self.tree.preorder_node_iter():                                                                                                                           
    if hasattr(node, "freq"):                                                                                                                                         
        del node.freq 

Would it make the more sense to remove the memoization of the global frequencies from the prediction pipeline so frequencies are always recalculated to match the given time interval or provide a optional keyword argument like "force_recalculate" to the prediction method?

trvrb commented 7 years ago

Closing in favor of development at nextstrain/augur.