CartwrightLab / dawg

Simulating Sequence Evolution
GNU General Public License v2.0
11 stars 3 forks source link

Dawg outputs sequences containing only gaps when simulating large MSAs with Indels #63

Closed trongnhanuit closed 2 years ago

trongnhanuit commented 2 years ago

I'm using Dawg (version 2.0.1) to simulate large MSAs with Indels but it returned output with only gaps (without any nucleotide) and the output sequence length was much shorter than the MSA simulated by INDELible on the same setting. The detail is as follows.

To make sure that issue was not due to the high deletion rate, I changed the Indel-rate to 0.15, 0.05 for the insertion, and deletion rates, respectively. Dawg returned sequences with 37235 sites with only gaps.

Besides, That issue also occurred when I replicated this simulation on a larger tree (with 1.000.000 tips). The output sequences of Dawg contain 35207 sites with only gaps. Meanwhile, on that simulation, INDELible outputs sequences with >150 000 sites with both gaps and nucleotides.

However, when I tested on smaller trees (e.g. with 10.000, 1.000 tips), Dawg outputs the sequences (containing both gaps and nucleotides) with the sequence length is close to that of MSAs simulated by INDELible.

Therefore, I think there is a bug in Dawg when simulating large/huge MSAs with Indels. Could you please help me to have a check? Many thanks,

Cheers, Nhan

jgarciamesa commented 2 years ago

Hello Nhan,

Thank you for your detailed description of the issue and for providing an input file. I wish all issues had that much information.

I believe we have found and fixed the bug responsible for this behavior. Please pull the last change and let me know if you still encounter any issues.

Best, Juan

trongnhanuit commented 2 years ago

Dear @jgarciamesa, @reedacartwright,
Thank you very much for promptly resolving this issue. Best wishes,

Nhan