Open mdhall272 opened 1 day ago
Hi, Matthew,
Excited to see you're putting Delphy through its paces!
When I ran this, the problem seems to come from the coalescent prior. Delphy currently only supports an exponentially growing population of the form $N(t) = N_0 \exp(g t)$. The currently hard-coded moves for the population growth rate $g$ propose $g + \Delta g$, where $\Delta g$ is uniformly distributed in the range of +/- 1/(1 year) (we're targeting emerging outbreaks where doubling times on the order of weeks/months are reasonable). However, since your tree seems to go back about 40 years, exponential growth over those timescales is explosive even with tiny rates, which seems to be triggering numerical issues (e.g., if g = 1/1 year, $\exp(g \times 40 {\rm years}) \approx 2.4 \times 10^{17}$). We currently have no way of specifying a difference step size or bounds on $g$.
If you go to Advanced Options, you can fix the population growth rate to 0, which gives you a constant population coalescent. That seems to behave reasonably well for this dataset. The convergence is a bit slow because you're leaving the realm of genomic epi datasets (your tree seems to have ~10k mutations for 100 tips, but in my mind, these two numbers are comparable in "genomic epi" datasets). But if you suspect that on covergence, the results are outright wrong, please let us know! I will not be surprised if you find ways of breaking Delphy in its current state by using inputs that are quite different from what we've been using for development and testing.
Thanks also for the note on CSV vs TSV : we'll fix this !
(As a side note, your data gives me an idea for improving the mdSPR moves to work well even when all the branches have large numbers of mutations. Our SPR moves currently fix the mutational history everywhere on the tree except on the branch that is moving. When you're trying to attach onto a branch with lots of mutations, you have to be quite lucky for the mutations to be ordered just so so that common mutations tend to come early and mutations unique to the two children of the attachment point come afterwards. That's the main reason convergence slows way down when branches have lots of mutations. However, you could instead allow all three branches impinging on the attachment point to have their mutations rewired (we had toyed around with this in 2023, but missing data makes that technically complicated, so we simplified to the current scheme). When mdSPR uses parsimony/Jukes-Cantor to estimate the cost of attaching a subtree at any given point on the remaining tree, allowing rewiring of the mutations on those three branches should still result in a simple expression for the proposal probability, and that should open the door to mdSPR moves that are effective for datasets like yours. Food for thought...)
Thanks - that makes sense and I did note that the tree prior was exponential growth, which obviously isn't right for HIV (which this is). I thought I would give it a go anyway. Where are "Advanced Options"?
Good to hear!
For Advanced Options, look here :
Then here :
As an aside, at the moment, the "Advanced Options" are really the absolute minimal complement of model configuration we can get away with while still being useful to people in public health responding to an emerging outbreak (honestly, we still need to add a few more, notably a flexible population model !).
When I try to run the attached file, the log posterior trace rapidly heads for the stratosphere (2.5E87!) and then seems to stick on a single value without changing. I don't think this is actually what is happening, because when I download the trace files the log posterior is instead of the order of -200000.
Also, by the by, the button says "Export Traces to CSV" but the downloaded file is a TSV.
github_example.fasta.zip