ekirving / qpbrute

Heuristic search algorithm for fitting qpGraph models
MIT License
9 stars 3 forks source link

qpBrute RuntimeError #9

Closed dbssyck closed 3 years ago

dbssyck commented 3 years ago

Dear Evan,

I am encountering an error when running qpBrute which cause I can't quite figure out. I have conducted multiple runs of qpBrute with the same platform with no issues prior to this. With my most recent dataset, qpBrute was running fine for ~4 days and froze half way through, only to show a RuntimeError after about 10 hours of being stuck. I reran qpbrute on the same dataset and the same behaviour happened (again no problems running for couple of days, then stuck and RuntimeError again).

Below is the error message together with the last few lines of the qpBrute log.

(Out,(a1,(a2,((a3,a4),(Gul,(Sib,(Gri,((Ran)a3,(Dau,(Fer)a4)a2))a1)))))); nodes=7 admix=4 outliers=124 worst=-19.680 4240a89 (Out,(a1,((a2,a3),(a4,(Gul,(Sib,(Gri,((Ran)a4,(Dau,(Fer)a3)a2))a1)))))); nodes=7 admix=4 outliers=108 worst=-24.342 21853a0 (Out,(a1,((a2,a3),((a4,(Fer)a3),(Gul,(Sib,(Gri,((Ran)a4,(Dau)a2))a1)))))); nodes=7 admix=4 outliers=111 worst=-13.417 330cbf6 Traceback (most recent call last): File "qpbrute.py", line 711, in <module> print_offset=argv.print_offset) File "qpbrute.py", line 663, in permute_qpgraph pq.find_graph() File "qpbrute.py", line 615, in find_graph self.recurse_tree(root_tree, self.nodes[1], self.nodes[2:]) File "qpbrute.py", line 104, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 143, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 143, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 143, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 140, in recurse_tree results = self.test_trees(admix_trees, depth) File "qpbrute.py", line 169, in test_trees results = pool.map(self.run_qpgraph, itertools.izip(new_trees, itertools.repeat(depth))) File "/home/user/anaconda3/envs/qpbrute/lib/python2.7/site-packages/pathos/multiprocessing.py", line 137, in map return _pool.map(star(f), zip(*args)) # chunksize File "/home/user/anaconda3/envs/qpbrute/lib/python2.7/site-packages/multiprocess/pool.py", line 253, in map return self.map_async(func, iterable, chunksize).get() File "/home/user/anaconda3/envs/qpbrute/lib/python2.7/site-packages/multiprocess/pool.py", line 572, in get raise self._value RuntimeError

The commands I used to run qpbrute was as follows.

`cd /path/to/qpbrute

python qpbrute.py \ --par admixgraph.angsd_10.par \ --prefix ANGSD_10_GuSiFeDaRaGr_outpopNULL \ --pops Gul Sib Fer Dau Ran Gri \ --out Out \ --threads 8`

I would be happy to share the .par and .ind files if they would help as well.

Thank you!

ekirving commented 3 years ago

Can you please try running with --treads 1 and post the error message again? A quirk of the way multithreading has been implemented is that the location of the errors are sometimes obscured.

You may also wish to update qpBrute to the latest version.

However, in both cases, the existing cache is invalidated and you will need to start the graph search from the beginning.

dbssyck commented 3 years ago

Thanks for the advice.

I ran the same command with --threads 1 on the same dataset (with the same qpbrute version, without upgrading) and the error message, together with the last 2 lines of the qpbrute log, was as follows.

(Out,((a1,(Fer)a2),(a3,((a4,a2),(Gul,(Sib,(Gri,((Ran)a4,(Dau)a3))a1)))))); nodes=7 admix=4 outliers=95 worst=-13.417 4d4d19f (Out,((a1,(Fer)a2),((a3,a2),(a4,(Gul,(Sib,(Gri,((Ran)a4,(Dau)a3))a1)))))); nodes=7 admix=4 outliers=87 worst=14.228 059ae73 Traceback (most recent call last): File "qpbrute.py", line 711, in <module> print_offset=argv.print_offset) File "qpbrute.py", line 663, in permute_qpgraph pq.find_graph() File "qpbrute.py", line 615, in find_graph self.recurse_tree(root_tree, self.nodes[1], self.nodes[2:]) File "qpbrute.py", line 104, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 143, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 143, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 143, in recurse_tree node_placed = self.check_results(results, remaining, depth) File "qpbrute.py", line 196, in check_results self.recurse_tree(new_tree, remaining[0], remaining[1:], depth + 1) File "qpbrute.py", line 140, in recurse_tree results = self.test_trees(admix_trees, depth) File "qpbrute.py", line 174, in test_trees result = self.run_qpgraph((new_tree, depth)) File "qpbrute.py", line 289, in run_qpgraph env={'OMP_NUM_THREADS': '1'}) File "/path/to/qpbrute-master/utils.py", line 52, in run_cmd raise RuntimeError(err) RuntimeError

ekirving commented 3 years ago

Thanks for the updated stack trace.

The error is coming from the external call to qpGraph and not from inside qpBrute itself. However, no error message is being return by qpGraph, which makes it hard to debug.

Can you please attach the log file for the last graph that was run?

You should be able to find this log file by running the following command ls -ltr graphs/*.log. It should be last log file.

dbssyck commented 3 years ago

Attached is the .log file from the last run, 059ae73. There strangely seems to be no error message. This was the last .log file when I ls -ltr in /graphs.

ANGSD_10_GuSiFeDaRaGr_outpopNULL_t1-059ae73.log

ekirving commented 3 years ago

It looks like that is the the log file from the last successful model (as 059ae73 appears just before the error) and that the error your are experiencing was not logged anywhere.

I've pushed a small change to the github repository which should make debugging these issues easier in the future, but because you're running an older version this will not help solve your problem.

However, if you make a small change to your local copy of the code then we should be able to discover why qpGraph is crashing.

In utils.py please change the following line:

def run_cmd(cmd, shell=False, stdout=None, stderr=None, env=None, verbose=False):

to this:

def run_cmd(cmd, shell=False, stdout=None, stderr=None, env=None, verbose=True):

and then restart the graph search.

When it returns to the problematic model, it will print the external qpGraph command that is being run. From this command, you should be able replicate the error and (hopefully) see why qpGraph is crashing.

ekirving commented 3 years ago

Not sure what happened to your earlier comment (looks like it was deleted), but I'll respond briefly just in case it's still useful.

The convertf software is not a dependency for running qpBrute so it's not included in the conda environment. If you want to use convertf to prepare your data then I recommend installing it via conda install -c bioconda eigensoft

Regarding the error RuntimeError: ERROR: 'b'fatalx:\nzero popsize: Gul\n''; RETCODE:-6, this seems to indicate that the population Gul has no samples present in your input data. Perhaps there is a typo in the population name?

dbssyck commented 3 years ago

My bad, the latest comment was indeed because of a typo. Realised my mistake immediately after commenting and hence deleted it. I'm currently in the process of running the same dataset on release 0.2 and will keep you updated on how it goes (fingers crossed!).

Thanks for the clarification regarding convertf as well.

ekirving commented 3 years ago

Okay, great. I'll close this issue for now, but feel free to reopen if necessary.

dbssyck commented 3 years ago

ANGSD_10_GuSiFeDaRaGr_error.log ANGSD_10_GuSiFeDaRaGr_outpopNULL.log

Hi Evan,

Sorry, it seems I have to indeed reopen this thread. This was ran on release 0.2 and again, the error doesn't appear immediately. It runs for several graphs, then stops halfway through. (shouldn't be an installation issue, the example file ran properly and with expected results). Attached is the qpbrute log and the error message.

ekirving commented 3 years ago

Thanks for reporting this new issue!

I've had a look at the logs you supplied and it seems this is caused by a race condition in which the pdf is being printed twice, which causes the cleanup step to fail because two separate threads are trying to delete the same temporary file.

I've pushed a bug fix for this issue to github. The simplest way to update is via pip, from inside you conda environment: pip install --upgrade --no-deps --force-reinstall git+git://github.com/ekirving/qpbrute.git

ekirving commented 3 years ago

Did that update fix the issue you were having?

dbssyck commented 3 years ago

Hi Evan,

Thanks so much for checking in and sorry I haven't been able to update you. My qpBrute run has been going on without problems since the update 6 days ago (and is still actively running as I type), so I think my answer is "Yes", but I'm not fully sure yet. I will inform you once the run is completed.

dbssyck commented 3 years ago

Hi Evan,

Still not getting there...I've attempted to run my dataset from my University cluster to speed up the process (the example file ran with the expected result - don't think it's an installation issue).

Unfortunately my own dataset is still facing problems. The errors logs and output log are attached.

Thanks again for your help.

error.log ANGSD_10_GuSiFeDaRaGr_outpopNULL.log

ekirving commented 3 years ago

This is rather odd, as the error log indicates that model 3da94d1b3f58 is the cause of the problem, but the main log file indicates that that model has been run successfully before.

Can you please attach the .graph file (and the .log file if present) for 3da94d1b3f58?

dbssyck commented 3 years ago

3da94d1b3f58.log

3da94d1b3f58_graph.txt

Attached are the respective files.

ekirving commented 3 years ago

Thanks for sending through those two files. They both look fine, which prompted me to re-examine the original log file. What I've noticed is that it appears that the graph search is being run twice simultaneously.

INFO: There are 720 possible starting orders for the given nodes.
INFO: Performing an exhaustive search.
INFO: Starting list ['Gul', 'Sib', 'Fer', 'Dau', 'Ran', 'Gri']
INFO: There are 720 possible starting orders for the given nodes.
INFO: Performing an exhaustive search.
INFO: Starting list ['Gul', 'Sib', 'Fer', 'Dau', 'Ran', 'Gri']
(Out,(Gul,Sib));                                                                    nodes=3  admix=0     outliers=0  worst=0.013     3eaff1860734
(Out,(Gul,Sib));                                                                    nodes=3  admix=0     outliers=0  worst=0.013     3eaff1860734
  (Out,(Fer,(Gul,Sib)));                                                            nodes=4  admix=0     outliers=7  worst=9.251     bc3749f895e6
  (Out,(Gul,(Fer,Sib)));                                                            nodes=4  admix=0     outliers=18     worst=-60.008   c3147c466046
  (Out,(Sib,(Fer,Gul)));                                                            nodes=4  admix=0     outliers=18     worst=-60.008   5e5cc8f99073
  (Out,(Gul,(Fer,Sib)));                                                            nodes=4  admix=0     outliers=18     worst=-60.008   c3147c466046
  (Out,(Sib,(Fer,Gul)));                                                            nodes=4  admix=0     outliers=18     worst=-60.008   5e5cc8f99073

Notice how the lines are repeated. Obviously this shouldn't be happening. Both the current and previous errors appear to stem from this issue.

Can you please check your script to see if you're accidentally calling qpBrute twice? If that's not the case, can you try running the test command and see if the output is being duplicated there also?

dbssyck commented 3 years ago

sim1.log

Yes, it seems that the duplicate issue is happening in the example file (attached) as well. I just checked the log from the example ran on my own console (and not the cluster) and it likewise showed duplicates. I don't think qpBrute is being called twice...

dbssyck commented 3 years ago

Sorry for the continual message. I just ran qpbrute on a dataset that previously worked on the first release. The program ran to completion with no errors, but upon inspection of the log (attached), with the same issue where the are are repeated lines. ANGSD_9_AFMSGC.600000_outpopNULL.log

ekirving commented 3 years ago

This is rather odd, as I cannot reproduce the behaviour on my machine.

Nevertheless, I made a small change which might make a difference.

Please update to the latest version and try the sim1 test again.

pip install --upgrade --no-deps --force-reinstall git+git://github.com/ekirving/qpbrute.git

dbssyck commented 3 years ago

I've updated the program and reran it, and I'm so sorry, I rechecked my previous scripts and it seems I indeed called the program twice unknowingly. The example file looks alright (attached).

Will keep you updated once again on how it goes, thanks so much for your help.

sim1.log

ekirving commented 3 years ago

Okay, great to hear that we've discovered the source of the problem.

I'll close this issue for now, but do let me know if you run into any further issues.