gemini 0.16.3 with lsf scheduler and multi-core fail a merge step.

mnhan1 commented 9 years ago

hi, running gemini with the --cores 2 --scheduler lsf --queue que1 option. I also tried with --tempdir /var/tmp, just in case /tmp was the problem. The job continually fails at the merge chunks step. The error from the sqlite db is: no such table: toMerge.variants. On the lsf node doing the merge i notice there was an empty chunk.db that was in the /tmp or /var/tmp of the node, depending on where tempdir was. i notice that the chunk.db were never being copied back to the shared storage. I modified the gemini_merge_chunks.py to print the chunk_db that it was loading and also get and os.stats of the file. The result was: /var/tmp/mnhan20.vcf.chunk1.db Traceback (most recent call last): File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/bin/gemini", line 5, in pkg_resources.run_script('gemini==0.16.3', 'gemini') File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/setuptools-0.6c11 -py2.7.egg/pkg_resources.py", line 489, in run_script File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/setuptools-0.6c11 -py2.7.egg/pkg_resources.py", line 1207, in run_script File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/gemini-0.16.3-py2 .7.egg/EGG-INFO/scripts/gemini", line 6, in gemini.gemini_main.main() File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/gemini-0.16.3-py2 .7.egg/gemini/gemini_main.py", line 1141, in main args.func(parser, args) File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/gemini-0.16.3-py2 .7.egg/gemini/gemini_main.py", line 317, in mergechunk_fn gemini_merge_chunks.merge_chunks(parser, args) File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/gemini-0.16.3-py2 .7.egg/gemini/gemini_merge_chunks.py", line 236, in merge_chunks merge_db_chunks(args) File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/gemini-0.16.3-py2 .7.egg/gemini/gemini_merge_chunks.py", line 201, in merge_db_chunks append_variant_info(main_curr, db) File "/gapp/x64linux/opt/pythonbrew/venvs/Python-2.7.9/gemini-dev/lib/python2.7/site-packages/gemini-0.16.3-py2 .7.egg/gemini/gemini_merge_chunks.py", line 18, in append_variant_info statinfo = os.stat(chunk_db) OSError: [Errno 2] No such file or directory: '/var/tmp/mnhan20.vcf.chunk1.db'

So it looks like its not able to locate the other chunk and sqlite creates a new mnhan20.vcf.chunk1.db zero size and thus gets the error " no such table: toMerge.variants" on the newly created empty db. I think the magic copy from tempdir isn't working correctly for distributed loading.

mnhan1 commented 9 years ago

my changes to gemini_merge_chunks.py

def append_variant_info(main_curr, chunk_db):
    """
    Append the variant and variant_info data from a chunk_db
    to the main database.
    """
+
+  print chunk_db
+  statinfo = os.stat(chunk_db)
+   print statinfo

brentp commented 9 years ago

can you have a look at the comments here: https://github.com/arq5x/gemini/commit/0ae48ab974b43822dd4bceb591c24034671135cd and see if they are relevant to your case?

mnhan1 commented 9 years ago

Yes i do believe its relevant to what we are seeing. When we run it via ipython, with the tempdir not shared its failing. But if tempdir is shared, then it defeats the purpose of it protecting against the nfs locking issue of sqlite. Is the fix some new magic copy to/from tempdir when ipython is used via the scheduler for distributed load? currently as it stands a nfs tempdir is needed for ipython when used via distribute loading.

chapmanb commented 9 years ago

Mike -- thanks for raising the issue here. See also the mailing list thread (https://groups.google.com/forum/#!topic/gemini-variation/O6a16V6afX8) where I tried unsuccessfully to debug. @bw2 -- do you have thoughts on this? I'm not totally sure how the --tempdir code you wrote is expected to behave for distributed IPython runs.

brentp commented 9 years ago

for now, the tempdir must be an NFS'd dir.

bw2 commented 9 years ago

@chapmanb , sorry - I didn't see this thread till now. From last time I looked at this, I think if the https://github.com/arq5x/gemini/commit/0ae48ab974b43822dd4bceb591c24034671135cd changes are reverted and --tempdir is set to a non-NFS'd dir, then ipython loading should work. The copy to/from tempdir logic (https://github.com/arq5x/gemini/blob/master/gemini/gemini_merge_chunks.py#L210) is based on the assumption that when the input dbs are copied to a tempdir they are being copied to a non-NFS'd dir (https://github.com/arq5x/gemini/blob/master/gemini/gemini_merge_chunks.py#L218) where all SQLite operations will work, and when the resulting merged db is moved back (https://github.com/arq5x/gemini/blob/master/gemini/gemini_merge_chunks.py#L227) it's assumed that it's getting put back on an NFS'd dir, so that subsequent merges will see the merged file regardless of what node they run on.

bw2 commented 9 years ago

Ps. If you guys agree this is the right idea, I can submit a pull request that undoes the https://github.com/arq5x/gemini/commit/0ae48ab974b43822dd4bceb591c24034671135cd changes and add better docs to relevant code and command line args.

chapmanb commented 9 years ago

Apologies, I'd totally forgotten I created this problem when trying to debug it. I hate to totally undo the change I put in because that avoids all of the temporary files being written to the output directory instead of the temporary directory. For multicore runs it defeats the purpose of having a temporary directory if we don't use it to write the output files.

To get it working again, I pushed a fix that undoes this for IPython but not for multicore runs. Does that resolve the issue? If so it would be awesome to have better docs on how best to set this for IPython runs. Thanks much for this and for bearing with me not knowing what is going on.

mnhan1 commented 9 years ago

the tempdir's purpose was to prevent nfs locking issue with sqlite. if the use of tempdir (local directory) is removed from ipython, is the nfs sqlite issue going to be a problem? Is the nfs sqlite locking issue more prevalent in a multicore scenario versus multi-host scenario?

chapmanb commented 9 years ago

If we'd like to work around that issue with IPython, the code would need to be expanded a bit. It would need to write the temporary files to --tempdir, a non-NFS directory, then copy them to the NFS mount so all of the temporary files are available across the cluster, then copy and merge them in --tempdir (again local). So it gets a bit more complex in the logic.

Practically I use multicore with tempdir, which can be set to local scratch in cases where NFS or other shared filesystems have issues so don't have a ton of experience with the IPython approach.

brentp commented 9 years ago

I think we should leave it as simple as possible for now, sacrificing compatibility with NFS mounts that don't work with sqlite. The new loader we're working should be > 10X as fast so even systems that can't use cluster via ipython will have very fast loading.

mnhan1 commented 9 years ago

was anything new done for this in 17.0? I ran a test with 17.0 without specifying --tempdir , with lsf and 2 cores and still fails. Didn't dig to much into it but i can if needed. Seem from looking at it, it started to merge in the share directory but then fails with: gemini merge_chunks --chunkdb /tmp/test.vcf.chunk0.db --chunkdb /tmp/test.vcf.chunk1.db --tempdir /tmp --db /mnt/sharedir/06e93e8f-90be-4d58-b6a9-675f3bf4b444.db' returned non-zero exit status 1

chapmanb commented 9 years ago

Sorry about the issues. Are you able to find the error messages you're getting from merge_chunks? This should work, with and without --tempdir on 0.17.0, so I'm not sure what is happening. There was an issue with VEP extra annotation columns in 0.17.0 that will be fixed in the soon to be released 0.17.1, so that could be a problem is you're using VEP annotated inputs. Happy to help more if you can find details about what caused the failure.

mnhan1 commented 9 years ago

Not sure if it related. I'm using --cores 2 and lsf. Its bsubing to 2 host with one host handling 2 ipython process. It reports its breaking my test.vcf.gz file into 2 chunks. I can see test.vcf.chunk1.db in /tmp of one node being created, but chunk0 is never created on the other node or any node. So chunk1 exists and chunk0 does not. So when it gets to the merge step there is nothing to merge as chunk0 never exists and it gets the "no such table: toMerge.variants" from sqlite in the merge_db_chunk routine. Running this locally and using multicores (--cores 2) shows both chunks being created and exists, and thus works fine. Still needs to set --tempdir to a shared location for things to work when using a scheduler.

chapmanb commented 9 years ago

Thanks for the detailed report. It seems like the best thing to do would specify in the docs that --tempdir needs to be specified as a shared location for IPython loading. The other avenues take a lot more logistics that don't current exist in the code because we'd need to copy back and forth from temp directories. If we added the tempdir requirement here:

http://gemini.readthedocs.org/en/latest/content/preprocessing.html#using-lsf-sge-slurm-and-torque-schedulers

would that help avoid confusion? Thanks much.

mnhan1 commented 9 years ago

I think my folks would be fine with having to specify --tempdir with lsf/ipython. I think as long as its documented/enforced that its needed so one don't forget to specify it, similar to the requirement made with --queue, then all is fine. I think sqlite with nfs works well enough the majority of the time that specifying --tempdir with a shared tempdir won't hurt.

thanks,

Michael

On Fri, 18 Sep 2015, Brad Chapman wrote:

Date: Fri, 18 Sep 2015 13:10:32 -0700 From: Brad Chapman notifications@github.com Reply-To: arq5x/gemini <reply+00ac17345117f17cb4fb4edb4b86be377f701840077783b092cf00000001121431b 892a169ce0602da0d@reply.github.com> To: arq5x/gemini gemini@noreply.github.com Cc: mnhan1 mnhan@genome.wustl.edu Subject: Re: [gemini] gemini 0.16.3 with lsf scheduler and multi-core fail a merge step. (#522)

Thanks for the detailed report. It seems like the best thing to do would specify in the docs that --tempdir needs to be specified as a shared location for IPython loading. The other avenues take a lot more logistics that don't current exist in the code because we'd need to copy back and forth from temp directories. If we added the tempdir requirement here:

http://gemini.readthedocs.org/en/latest/content/preprocessing.html#using-lsf-sge-slurm-and-torque-s chedulers

would that help avoid confusion? Thanks much.

— Reply to this email directly or view it on GitHub.[AKwXNAGZieRX2qeD6LpZRnM6crtg1PV-ks5ozGc4gaJpZM4FrNV2.gif]

    ---//---

Time flies like the wind. Fruit flies like bananas. --- Groucho Marx

Either write something worth reading or do something worth writing. --- Benjamin Franklin

A meeting is an event at which the minutes are kept and the hours are lost

This email message is a private communication. The information transmitted, including attachments, is intended only for the person or entity to which it is addressed and may contain confidential, privileged, and/or proprietary material. Any review, duplication, retransmission, distribution, or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is unauthorized by the sender and is prohibited. If you have received this message in error, please contact the sender immediately by return email and delete the original message from all computer systems. Thank you.

arq5x / gemini

gemini 0.16.3 with lsf scheduler and multi-core fail a merge step. #522