abacus-gene / hhsd

HHSD is a python pipeline for hierarchical heuristic species delimitation using genomic sequence data under the multispecies coalescent model. It drives data analysis using the program bpp.
GNU General Public License v3.0
2 stars 0 forks source link

no split accepted #11

Closed JimStarrett closed 1 month ago

JimStarrett commented 1 month ago

Hello, I'm trying to understand some preliminary results with my dataset where the merge analysis is retaining 13 species (ie not merging any groups), but the split analysis is not accepting even the deepest split in the tree. The dataset is for a highly structured species complex, so no surprise about not merging any groups, but I would expect the lower bound on number of species to be more than 1. The data has >100 phased loci for >100 individuals. Post-burnin values look reasonable in Tracer.

It's not clear to me what about the data is influencing the gdis for the initial split to not meet the threshold. Any suggestions of what to look for in the results files?

thank you

dkornai commented 1 month ago

Hi, can you please supply your data, IMAP, and contro file? It is very difficult to say anything definitive without being able to reproduce your analysis.

As a debug step, I recommend you run the program in gdi estimation mode [where all merge or split proposals are accepted] by setting gdi_threshold = None in the control file, and seeing the resulting gdi values.

JimStarrett commented 1 month ago

Aty_hhsd_60lump_IMAP.txt

cf_Aty_split.txt

Atypoides_hhsd_935p_phased_renamed.phy.zip

dkornai commented 1 month ago

Thanks for the files, I modified your control file to lower the execution time, first by manually providing tau and theta priors (this sidesteps the automatic inference of these priors, which is quite costly on such a large alignment file), and also by lowering the number of loci being used in the calculations to 8. I also lowered the number of samples, and removed the lines corresponding to migration, since these were commented out by you.

I set the thresholds such that all splits will be accepted just to see what the ballpark gdi values are.

# output folder
output_directory = results

# input files
Imapfile = Aty_hhsd_60lump_IMAP.txt
seqfile  = Atypoides_hhsd_935p_phased_renamed.phy

# guide tree
guide_tree = (((((G,H),(I,J)),(K,(M,L))),(F,(E,D))),(A,(B,C)));

# hierarchical algorithm settings
mode = split
gdi_threshold = >=0.0, >=0.0

# BPP MCMC settings
threads = 8
burnin =  500
nsample = 1000
nloci = 8
tauprior = 3 0.01
thetaprior = 3 0.01 e

If you run with this control file, it will give you gdi values for all groups eventually, and you can check if the rest of the values make sense.

I got the following results for the first split:

< Analysis started >

> Starting state of split analysis

Number of species in starting delimitation:  1
'GHIJKMLFEDABC'

--GHIJKMLFEDABC

*** Iteration 1 ***

> Estimated tau and theta parameters:

                  theta  2.5% HPD  97.5% HPD       tau  2.5% HPD  97.5% HPD
ABC            0.025282  0.022110   0.029684                               
GHIJKMLFED     0.102492  0.093182   0.110447                               
GHIJKMLFEDABC  0.018354  0.012818   0.026323  0.010411  0.009639   0.011213

> Proposal results:

    node 1  GDI 1  2.5% HPD  97.5% HPD node 2  GDI 2  2.5% HPD  97.5% HPD  split accepted?
GHIJKMLFED   0.18      0.17        0.2    ABC   0.56      0.51       0.63             True

Number of species after iteration 1:  2
'GHIJKMLFED', 'ABC'

   /-GHIJKMLFED
--|
   \-ABC
...

The gdi for population A in the case two populations A and B with no migration is calculated using the following formula:

$$ gdi = 1 - e^{-2\tau_{AB} / \theta_A} $$

so low gdi values at the first split could be due to either the tau or theta parameters. Do you have any idea if the inferred values match previous literature results about the age and genetic diversity of the group you are working on?

In fact, if you have some literature derived ballpark values for these figures, feel free to set them manually as priors. The manual has a section on how to do this.

Finally, the "reference" gdi thresholds given in previous papers or our paper are somewhat arbitrary (even though they were calibrated on a significant number of cases). It may be the case that they are not appropriate for the spiders you are working on. Perhaps try the program on some closely related groups where species status is known with high confidence, and check if the inferred gdi values match those biologically grounded expectations.

However, since there does not seem to be an issue related to the hhsd software itself, I will close this thread.