CompEvol / beast2

Bayesian Evolutionary Analysis by Sampling Trees
www.beast2.org
GNU Lesser General Public License v2.1
236 stars 83 forks source link

*BEAST auto insert empty sequences option #165

Open rbouckaert opened 10 years ago

rbouckaert commented 10 years ago

If not all gene trees have sequences for all species, for each such species a single empty sequence could be added to the alignment, and thus the gene tree, to create a valid *BEAST analysis.

alexeid commented 10 years ago

Great idea! As a general rule, such modifications to data/model should be transparent to the user and made with user’s consent. So a dialog box listing the gene trees to be modified and which species the dummy sequences will be added for, with (Okay) and (Cancel) options would be good.

On 15/07/2014, at 10:12 am, rbouckaert notifications@github.com wrote:

If not all gene trees have sequences for all species, for each such species a single empty sequence could be added to the alignment, and thus the gene tree, to create a valid *BEAST analysis.

— Reply to this email directly or view it on GitHub.

rbouckaert commented 10 years ago

Not sure whether this should be a BEAUti option or a BEAST option.

What pleads for BEAUti is that it can get consent from the user, but if the taxonset is changed afterwards, the dummy sequences are still lingering. This means we should have a mechanism for removing dummy sequences as well.

On the other hand, adding sequences in BEAST means we have to reinitialise the alignment and corresponding tree as well, which requires a bit more administration.

Remco

On Mon, 2014-07-14 at 15:20 -0700, Alexei Drummond wrote:

Great idea! As a general rule, such modifications to data/model should be transparent to the user and made with user’s consent. So a dialog box listing the gene trees to be modified and which species the dummy sequences will be added for, with (Okay) and (Cancel) options would be good.

On 15/07/2014, at 10:12 am, rbouckaert notifications@github.com wrote:

If not all gene trees have sequences for all species, for each such species a single empty sequence could be added to the alignment, and thus the gene tree, to create a valid *BEAST analysis.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

alexeid commented 10 years ago

I would advocate strongly for BEAUti. BEAST should never change anything about model/data.

The BEAST input XML is the definitive description of the analysis and it should be totally clear from reading the XML exactly what the analysis will be.

BEAST should do nothing but follow the instructions in that XML :)

BEAUti on the other hand is a tool to help construct sensible input XMLs.

An alternative solution to this problem would be to change the *BEAST implementation to directly handle gene trees that are missing some species. This is of course technically achievable, but when I spoke to Joseph about it he said it would involve some large changes to the implementation. It sounds like it would be a serious piece of research and would undoubtedly require substantial tests and simulations to verify.

On 15/07/2014, at 10:27 am, rbouckaert notifications@github.com wrote:

Not sure whether this should be a BEAUti option or a BEAST option.

What pleads for BEAUti is that it can get consent from the user, but if the taxonset is changed afterwards, the dummy sequences are still lingering. This means we should have a mechanism for removing dummy sequences as well.

On the other hand, adding sequences in BEAST means we have to reinitialise the alignment and corresponding tree as well, which requires a bit more administration.

Remco

On Mon, 2014-07-14 at 15:20 -0700, Alexei Drummond wrote:

Great idea! As a general rule, such modifications to data/model should be transparent to the user and made with user’s consent. So a dialog box listing the gene trees to be modified and which species the dummy sequences will be added for, with (Okay) and (Cancel) options would be good.

On 15/07/2014, at 10:12 am, rbouckaert notifications@github.com wrote:

If not all gene trees have sequences for all species, for each such species a single empty sequence could be added to the alignment, and thus the gene tree, to create a valid *BEAST analysis.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

rbouckaert commented 10 years ago

I see where you are coming from: alignments are data, thus should be treated as sacred.

However, there is a bit of a grey area here: BEAST does initialise a number of state nodes, for example the tree when initialised as random-tree, rate indicators in relaxed clocks, etc. The StarBeastStartState even sets values for birth rate and pop sizes. This means that data as specified in the XML can be changed by BEAST.

Setting a flag in BEAUti that says "add dummy sequences, if required" seems a viable option to me, and to some extent follows what you says "BEAST should do nothing but follow the instructions in that XML" since the XML will tell BEAST to add dummy sequences.

What worries me is that letting BEAUti do this may not be as robust as letting BEAST sort out the dummy sequences.

Perhaps the addition of dummy sequences should only happen when saving the BEAST specification in BEAUti?

On Mon, 2014-07-14 at 15:45 -0700, Alexei Drummond wrote:

I would advocate strongly for BEAUti. BEAST should never change anything about model/data.

The BEAST input XML is the definitive description of the analysis and it should be totally clear from reading the XML exactly what the analysis will be.

BEAST should do nothing but follow the instructions in that XML :)

BEAUti on the other hand is a tool to help construct sensible input XMLs.

An alternative solution to this problem would be to change the *BEAST implementation to directly handle gene trees that are missing some species. This is of course technically achievable, but when I spoke to Joseph about it he said it would involve some large changes to the implementation. It sounds like it would be a serious piece of research and would undoubtedly require substantial tests and simulations to verify.

On 15/07/2014, at 10:27 am, rbouckaert notifications@github.com wrote:

Not sure whether this should be a BEAUti option or a BEAST option.

What pleads for BEAUti is that it can get consent from the user, but if the taxonset is changed afterwards, the dummy sequences are still lingering. This means we should have a mechanism for removing dummy sequences as well.

On the other hand, adding sequences in BEAST means we have to reinitialise the alignment and corresponding tree as well, which requires a bit more administration.

Remco

On Mon, 2014-07-14 at 15:20 -0700, Alexei Drummond wrote:

Great idea! As a general rule, such modifications to data/model should be transparent to the user and made with user’s consent. So a dialog box listing the gene trees to be modified and which species the dummy sequences will be added for, with (Okay) and (Cancel) options would be good.

On 15/07/2014, at 10:12 am, rbouckaert notifications@github.com wrote:

If not all gene trees have sequences for all species, for each such species a single empty sequence could be added to the alignment, and thus the gene tree, to create a valid *BEAST analysis.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

alexeid commented 10 years ago

Perhaps this would be sufficient:

(1) an explicit option in the XML and, (2) an informative message in the BEAST standard output detailing exactly what dummy sequences were added in

In general, maybe we need a mechanism to allow "Are you sure?" type popups (with a specific detail message) in BEAUti that can be configured to be triggered when checkboxes are selected or deselected?

Then when, for example, the user checks the "add dummy sequences when necessary" box in BEAUti a dialog box with some details of what this entails and why it is done can be displayed.

On 15/07/2014, at 11:01 am, rbouckaert notifications@github.com wrote:

I see where you are coming from: alignments are data, thus should be treated as sacred.

However, there is a bit of a grey area here: BEAST does initialise a number of state nodes, for example the tree when initialised as random-tree, rate indicators in relaxed clocks, etc. The StarBeastStartState even sets values for birth rate and pop sizes. This means that data as specified in the XML can be changed by BEAST.

Setting a flag in BEAUti that says "add dummy sequences, if required" seems a viable option to me, and to some extent follows what you says "BEAST should do nothing but follow the instructions in that XML" since the XML will tell BEAST to add dummy sequences.

What worries me is that letting BEAUti do this may not be as robust as letting BEAST sort out the dummy sequences.

Perhaps the addition of dummy sequences should only happen when saving the BEAST specification in BEAUti?

On Mon, 2014-07-14 at 15:45 -0700, Alexei Drummond wrote:

I would advocate strongly for BEAUti. BEAST should never change anything about model/data.

The BEAST input XML is the definitive description of the analysis and it should be totally clear from reading the XML exactly what the analysis will be.

BEAST should do nothing but follow the instructions in that XML :)

BEAUti on the other hand is a tool to help construct sensible input XMLs.

An alternative solution to this problem would be to change the *BEAST implementation to directly handle gene trees that are missing some species. This is of course technically achievable, but when I spoke to Joseph about it he said it would involve some large changes to the implementation. It sounds like it would be a serious piece of research and would undoubtedly require substantial tests and simulations to verify.

On 15/07/2014, at 10:27 am, rbouckaert notifications@github.com wrote:

Not sure whether this should be a BEAUti option or a BEAST option.

What pleads for BEAUti is that it can get consent from the user, but if the taxonset is changed afterwards, the dummy sequences are still lingering. This means we should have a mechanism for removing dummy sequences as well.

On the other hand, adding sequences in BEAST means we have to reinitialise the alignment and corresponding tree as well, which requires a bit more administration.

Remco

On Mon, 2014-07-14 at 15:20 -0700, Alexei Drummond wrote:

Great idea! As a general rule, such modifications to data/model should be transparent to the user and made with user’s consent. So a dialog box listing the gene trees to be modified and which species the dummy sequences will be added for, with (Okay) and (Cancel) options would be good.

On 15/07/2014, at 10:12 am, rbouckaert notifications@github.com wrote:

If not all gene trees have sequences for all species, for each such species a single empty sequence could be added to the alignment, and thus the gene tree, to create a valid *BEAST analysis.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.