abacus-gene / hhsd

HHSD is a python pipeline for hierarchical heuristic species delimitation using genomic sequence data under the multispecies coalescent model. It drives data analysis using the program bpp.
GNU General Public License v3.0
2 stars 0 forks source link

Resolving discrepancies between gdi1 and gdi2 #1

Closed redgcko7 closed 4 months ago

redgcko7 commented 6 months ago

First of all, thank you for developing this useful program for species delimitation. It has been fun playing around on it with my dataset these past few weeks.

I have a more conceptual question that could potentially lead to an improvement to the program. As the merge and split algorithms are dependent on the thresholds set by the user, I've come across the "awkward solution" (per Leaché et al. 2019) where gdi1 fulfills a threshold to split (or merge), but gdi2 does not, in which case the program decides to split (or merge) the taxa and continue with another iteration. This is due to differing values of theta for each of the populations being compared when calculating the gdi index. I guess my question is, should both gdi1 and gdi2 need to meet the split (or merge) thresholds in order to split (or merge) for the next iteration? I believe this would make more conceptual sense, as the threshold values are meant to differentiate between when we are certain about merging/splitting and when there is ambiguity. When gd1 and gdi2 are on different sides of these thresholds, this suggests a level of ambiguity that is overlooked by the action of the algorithm that, as implemented, only requires one of the two resulting gdi values to meet the threshold.

I'd be interested to hear your thoughts about this.

dkornai commented 5 months ago

Hi, sorry for the late reply.

First, thanks for taking the time to think about the conceptual problems related to the merge and split algorithms. Indeed, ambiguity between gdi values in a taxon pair is an inherent consequence of the way it is defined.

Currently, the program operates in the following way:

def node_pair_decision(
        node_1:         TreeNode,
        node_2:         TreeNode,
        cf_dict:        CfileParam
        ) ->            None: # in-place modification of node attributes

    '''
    Decide to accept or reject a merge/split proposal
    '''

    mode:AlgoMode = cf_dict['mode']
    gdi_threshold = cf_dict['gdi_threshold']

    if   mode == 'merge':
        # if at least one of the 2 populations has low gdi indicating non-species status, merge the nodes
        if gdi_threshold == None or (node_1.gdi <= gdi_threshold) or (node_2.gdi <= gdi_threshold): 
            node_1.species = False; node_1.modified = True
            node_2.species = False; node_2.modified = True
        else:
            node_1.modified = False; node_2.modified = False

    elif mode == 'split':
        # if both populations have high gdi indiciating disctinct species status, split them into 2 species
        if gdi_threshold == None or ((node_1.gdi >= gdi_threshold) and (node_2.gdi >= 0.5)) or ((node_2.gdi >= gdi_threshold) and (node_1.gdi >= 0.5)):
            node_1.species = True;  node_1.modified = True
            node_2.species = True;  node_2.modified = True
        else:
            node_1.modified = False; node_2.modified = False

Previous versions of the program [many years ago, when it was called HMDelimit instead of hhsd ] included an option to specify if one or both of the gdi values should be within the threshold. If this is a feature you think would be useful, we can add it back.

dkornai commented 5 months ago

I added some new features today relating to confidence intervals on the gdi. For example, the output is now:

> Proposal results:

node 1  GDI 1  2.5% HPD  97.5% HPD node 2  GDI 2  2.5% HPD  97.5% HPD  merge accepted?
     A   0.04      0.01       0.13      B   0.03      0.01       0.11             True
     C   0.03      0.01       0.11      D   0.02      0.01       0.09             True

If you have any ideas for how this can be integrated into the process of resolving conflicts, feel free to share!

dkornai commented 4 months ago

The new version of hhsd changes the syntax of gdi_threshold. Users must now supply two threshold values. For example, in a split analysis, we can use gdi_threshold = gt 0.7, gt 0.7, which only accepts a merge if both gdi values are greater than (gt) 0.7. Alternatively, for a merge analysis, gdi_threshold = leq 0.2, leq 1.0 will accept any merge where at least one of the populations is not a unique species, with a gdi less than or equal (leq) to 0.2.