NicolaDM / phastSim

fast sequence evolution simulation for SARS-CoV-2 and similar
GNU General Public License v3.0
16 stars 5 forks source link

"Vanilla" approach (--noHierarchy) took a longer runtime than “Hierarchical” approach (without --noHierarchy) in simulations with discrete rates (--categoryRates) #17

Closed trongnhanuit closed 2 years ago

trongnhanuit commented 2 years ago

Dear authors, I'm using phastSim v0.0.4. In my simulations with discrete rates (--categoryRates), I found that the "Vanilla" approach (--noHierarchy) took a longer (4x) runtime than the “Hierarchical” approach while the "Vanilla" approach was expected to run faster than the “Hierarchical” approach in these simulation settings. Could you please help me to check it? Many thanks.

1. Execution commands:

2. Input files:

Could someone help me to check it? Many thanks.

Cheers, Nhan

NicolaDM commented 2 years ago

Dear Nhan,

The relative performance of the hierarchical vs the "vanilla" approach depends on several factors. The performance of the vanilla approach worsens with more complex models (more rate variation categories, and more complex matrices, e.g. GTR instead of JC substitution model). On the other hand, the performance of the vanilla approach is usually not affected much by the genome size, while the hierarchical approach can be more affected by it. Finally, I don't have access to the tree file you linked, but I assume it contains 100 tips; in this scenario the creation of the initial genome search tree might take a sizeable chunk of time compared to the actual running of the simulations, making the hierarchical approach relatively slower than the vanilla one which doesn't require this step.

So, in summary, the relative performance of the vanilla approach should increase if you would consider a simpler model (e.g. no rate variation), a longer root genome, and a tree with more tips; in general, the vanilla approach is not guaranteed to be faster, which is the main reason why we created the hierarchical approach in the first place!

Cheers, Nicola

trongnhanuit commented 2 years ago

Dear Nicola, Thank you very much for your useful information. In fact, I need to simulate alignments from large trees (with >10.000 tips) with discrete rate variation, therefore, I'll use the hierarchical approach because it is faster. Many thanks, Cheers, Nhan

NicolaDM commented 2 years ago

Dear Nhan,

Just to be sure, unless the tree you used before already had 10,000 samples, and if you want to make sure you have the fastest approach in the considered scenario, I would suggest to try both options with a tree with 10,000 samples first - the relative performance of the two algorithms might change when very different numbers of samples are considered.

Cheers, Nicola

trongnhanuit commented 2 years ago

Dear Nicola, In fact, I already tried with 10K and 100K before testing with 100 samples and found that the hierarchical was faster than the vanilla approach in my simulation settings (with discrete rate variation, the branch lengths were not too short). Therefore, I reported it here to make sure I used the right approach for my simulations. Many thanks for your support.

Cheers, Nhan

NicolaDM commented 2 years ago

Great! Cheers, Nicola