bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

Change the plot_model layout algorithm #135

Closed bodkan closed 1 year ago

bodkan commented 1 year ago

As soon as more complex demographic models were implemented using slendr, a clear and very annoying issue popped up.

Consider the following random model:

o <- population("outgroup", time = 1, N = 1000)
pop_ancA <- population("pop_ancA", time = 100, parent = o, N = 3000, remove = 200)
popA1 <- population("popA1", time = 200, parent = pop_ancA, N = 400)
popA2 <- population("popA2", time = 200, parent = pop_ancA, N = 10000)
pop_ancB <- population("pop_ancB", time = 500, parent = o, N = 7000, remove = 800)
popB1 <- population("popB1", time = 800, parent = pop_ancB, N = 200)
popB2 <- population("popB2", time = 800, parent = pop_ancB, N = 2000)
popX <- population("popX", time = 801, parent = popB1, N = 1500)
popY <- population("popY", time = 802, parent = popX, N = 8500)
pop_ancC <- population("pop_ancC", time = 700, parent = o, N = 2000, remove = 1000)
popC1 <- population("popC1", time = 901, parent = pop_ancC, N = 550)
popC2 <- population("popC2", time = 901, parent = pop_ancC, N = 7500)

model <- compile_model(
  list(o, pop_ancA, popA1, popA2, pop_ancB, popB1, popB2, popX, popY, pop_ancC, popC1, popC2),
  generation_time = 1, simulation_length = 1000
)

When this is visualized, the following silliness pops up:

plot_model(model)
image

The issue is especially prominent if many split and gene flow events happen close in time.

In my defense, when plot_model() was written, I only used fairly compact models for testing slendr. With more complex models being programmed these days -- either imported via demes->slendr importer written by @IsabelMarleen, models developed and tested as part of the work on our new ABC package, or randomized models -- this is getting more and more annoying.

This PR is an attempt to fix this using a relatively straightforward solution while keeping most of the plot_model code intact (that took a lot of ggplot2 black magic and I'd rather not touch that unless I absolutely have to).

The main idea is to solve most of the most apparent issues which are hindering the role of plot_model() for debugging and inspection of models, not more (a graph can always be made prettier but that's not the goal here).

bodkan commented 1 year ago

The issue comes down to the ordering of populations along the x-axis. The current slendr plot_model algorithm simply sorts populations according to the split time, and arranges the population polygons in sort of a cascading layout going from left to right (picture above).

My first idea was to switch this ordering algorithm to one which orders population not according to the split time, but according to the in-order traversal principle. Implementation is here.

When ran on the example model above, it seems to be working OK?

plot_model(model)
image
plot_model(model, sizes = FALSE)
image
bodkan commented 1 year ago

A couple of more tests using models that @IsabelMarleen implemented for her thesis. All of those were completely unreadable with the old plot_model() but look quite nice now. (Except for the gene flow arrows which... I'm not entirely sure how to fix for extremely complex migration models and probably won't be tackling here.)

image

Previous version of this was next to useless:

image image image
bodkan commented 1 year ago

Hmm, another test -- what if there are multiple ancestral populations (i.e. populations without a parental population?).

o1 <- population("outgroup1", time = 1, N = 5000)
pop_ancA <- population("pop_ancA", time = 100, parent = o1, N = 3000, remove = 200)
popA1 <- population("popA1", time = 200, parent = pop_ancA, N = 400)
popA2 <- population("popA2", time = 200, parent = pop_ancA, N = 10000)
o2 <- population("outgroup2", time = 1, N = 7000, remove = 800)
popB1 <- population("popB1", time = 800, parent = o2, N = 200)
popB2 <- population("popB2", time = 800, parent = o2, N = 2000)
popX <- population("popX", time = 801, parent = popB1, N = 1500)
popY <- population("popY", time = 802, parent = popX, N = 8500)
pop_ancC <- population("pop_ancC", time = 700, parent = o1, N = 2000, remove = 1000)
popC1 <- population("popC1", time = 901, parent = pop_ancC, N = 550)
popC2 <- population("popC2", time = 901, parent = pop_ancC, N = 7500)

model <- compile_model(
  list(o1, o2, pop_ancA, popA1, popA2, popB1, popB2, popX, popY, pop_ancC, popC1, popC2),
  generation_time = 1, simulation_length = 1000
)

plot_model(model)

Old plot_model():

image

New plot_model() from this PR:

image

Definitely more useful.

I am kind of bothered by the overlapping population labels which can also obscure the splitting lineages (observe the purple "popB1" overlapping with darkish blue "popX", and also obscuring where is the split happening from-to and when).

I wonder if labels of populations which are not themselves parents of something could be plotted on the bottom? The overlap will always be there, unfortunately, but at least the split of lineages will be visible?

bodkan commented 1 year ago

It was also appropriate to fix #131 here.

bodkan commented 1 year ago

I am kind of bothered by the overlapping population labels which can also obscure the splitting lineages (observe the purple "popB1" overlapping with darkish blue "popX", and also obscuring where is the split happening from-to and when).

I wonder if labels of populations which are not themselves parents of something could be plotted on the bottom? The overlap will always be there, unfortunately, but at least the split of lineages will be visible?

OK, I ended up placing labels of populations which:

along the bottom of the output.

image

I think this looks much nicer than the first draft version:

In particular in the model which has many split events happening quickly in sequence, the splits themselves are no longer obscured by the labels (image below). Not super clear but much clearer. Maybe distributing colors of lineages not in a gradient with colors closer the closer the populations are along the x-axis would help? Although significant jumps in color wouldn't look that nice. I think I'll keep things this way, for now.

image

I have a feeling something might be still off with plotting on a log10-scale y-axis. I will think about that tomorrow.

That said, I think we're there, @IsabelMarleen! I will ping you again once I merge the PR (tomorrow, I think). You'll then be able to update figures in your thesis.

codecov-commenter commented 1 year ago

Codecov Report

Merging #135 (03444e5) into main (109732f) will increase coverage by 0.10%. The diff coverage is 98.18%.

@@            Coverage Diff             @@
##             main     #135      +/-   ##
==========================================
+ Coverage   83.77%   83.88%   +0.10%     
==========================================
  Files           6        6              
  Lines        3076     3103      +27     
==========================================
+ Hits         2577     2603      +26     
- Misses        499      500       +1     
Impacted Files Coverage Δ
R/visualization.R 64.12% <98.18%> (+2.37%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

bodkan commented 1 year ago

A few minor updates and fixes (in particular, y-axis log10 time scale turned out to be broken for forward-time models). I also changed the color of the "population split lines" to match the population that is diverging from its parent population, rather than the other way around.

Screenshot 2023-04-25 at 13 46 54

I could imagine tweaking and aligning pixels here and there but... I think two days was more than enough time to spend on this, so I will merge the PR as soon as all CI tests pass.

I will likely decorate this with a v0.6.0 version too as there have been reasonably useful fixes and changes implemented since the last version that would be nice to have on CRAN for everyone out there.