avaughn271 / AdmixtureBayes

GNU General Public License v3.0
18 stars 1 forks source link

error with runMCMC.py: SVD did not converge #2

Open bunop opened 1 year ago

bunop commented 1 year ago

dear @avaughn271,

we are running AdmixtureBayes on 3 different chains starting from a previous simulation with a command line like this:

python admixturebayes/runMCMC.py \
    --input_file vik_sheeps.treemix.frq.fix.nomiss.gz \
    --outgroup MIR --bootstrap_blocksize 500 --n 100000 \
    --continue_samples mcmc_samples1.csv \
    --result_file chain1.csv

but 2 of 3 simulations failed with the following errors:

total number of SNPs: 34865
Process Process-5:
Traceback (most recent call last):
  File "/home/manunzaa/.conda/envs/admixturebayes/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/manunzaa/.conda/envs/admixturebayes/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/core/software/AdmixtureBayes/admixturebayes/MCMC.py", line 30, in __call__
    self.response_queue.put(self.run_chain(input))
                            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/MCMC.py", line 34, in run_chain
    return basic_chain(start_tree,  self.summaries,  self.posterior_function, self.proposal,  post,  N,  sample_verbose_scheme, i_start_from, temperature, proposal_update, multiplier)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/MCMC.py", line 96, in basic_chain
    new_x,new_post=one_jump(x, post, temperature, posterior_function, proposal, proposal_knowledge_scraper)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/MCMC.py", line 61, in one_jump
    newx,g1,g2,Jh,j1,j2=proposal(x,pks)
                        ^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/meta_proposal.py", line 95, in __call__
    new_x, forward, backward = self.props[index](x, *args, pks=pks)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/mcmc_proposals.py", line 178, in __call__
    return rescale_constrained(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/mcmc_proposals.py", line 146, in rescale_constrained
    orgs, bi= get_orthogonal_branch_space(new_tree, add_one_column=update_add)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/mcmc_proposals.py", line 77, in get_orthogonal_branch_space
    ad=nullspace(cof)
       ^^^^^^^^^^^^^^
  File "/home/core/software/AdmixtureBayes/admixturebayes/mcmc_proposals.py", line 68, in nullspace
    u, s, vh = svd(A)
               ^^^^^^
  File "/home/manunzaa/.conda/envs/admixturebayes/lib/python3.11/site-packages/scipy/linalg/_decomp_svd.py", line 131, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

This affects only one of the eight subprocess of the simulation we launched, however all the other subprocess block their execution and no more results are produced.

Is this a known behavior? does it depend on our data or our starting conditions? I see also we are not running your latest version; we are currently using a540033 . Is this problem addressed in a new AdmixtureBayes release?

here is the conda-environment we are using.

Thank you in advance for your help

avaughn271 commented 1 year ago

I have actually never seen this behavior before and cannot replicate it on my test datasets. I'm wondering if it might depend on the dataset being used. If you are comfortable sharing the dataset you are running (vik_sheeps.treemix.frq.fix.nomiss.gz) with me, that might shed some light on the issue. Sorry I can't be of more help. Do you only notice this issue when using the " --continue_samples" option, or do you notice it other times as well? Similarly, do you notice it if you set "--bootstrap_blocksize 1000" or only when you use "--bootstrap_blocksize 500"?

bunop commented 1 year ago

Thank you for your reply. Regarding your questions: we encountered this behavior with --continue_samples. Ideally, we made a simulation like you did in your article. We started three chains from scratch, then we continued these chains with no problems and after, when we tried to continue the chains again, we faced this problem on chain 1 and 3. We also think that could be related to our data and that this problem could occur sometimes. For example, we have resubmitted chain 1 and 3 (same parameters) and now 3 has been completed while 1 not. We have not tried with a different bootstrap block size. For sharing data, I need to ask for the permission to my colleagues.

By the way, I'm testing the chain1 with more threads (16) and with the latest version of your code and it is still running. I'm also testing your admixturebayes/findOptimalTemps.py script on the same dataset: is not clear to me when this script will finish, however I see in the meantime that it suggests to me temperatures in which I see mixing rates between chains higher than the ones I saw in our previous simulations, for example we have: MCMCMC chain swap rates: ['0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%', '0.0%'] after completing one chain in a previous run.

Is this a signal that we are far away from convergence and that we have to consider temperatures in our simulations? could this SVD convergence problem arise from the fact that there are no swaps between chains?

Thanks again for your help

avaughn271 commented 1 year ago

Yes, this is a sign that there are no swaps between the chains, which is a little concerning, although I don't know how that could affect the SVD convergence. Of course I understand that you would need to check in with colleagues before sharing data, but it is hard for me to figure out what could be going wrong if I can't replicate this error myself.