PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
46 stars 20 forks source link

Relocate from Langley to BEVi (cluster)? #66

Closed ArtPoon closed 3 years ago

ArtPoon commented 4 years ago
ArtPoon commented 3 years ago

Note the cluster is running on CentOS so there may be some incompatibilities to look out for!

ArtPoon commented 3 years ago

Successful run on head node:

[covizu@BEVi covizu]$ mpirun -np 3 python3 covizu/clustering.py data/by_lineage.json B --nboot 3 --outdir data
🦠 [0:00:00.000017] loading JSON
🦠 [0:00:00.000015] loading JSON
🦠 [0:00:00.000015] loading JSON
🦠 [0:00:05.165614] Reduced to 1571 variants; generating feature set union
[...]
🦠 [0:00:11.954229]   row 1500 of 1571
🦠 [0:00:11.963294]   row 1500 of 1571

but running on other nodes failed:

[covizu@BEVi covizu]$ mpirun --machinefile mfile python3 covizu/clustering.py data/by_lineage.json B --nboot 10 --outdir data
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Traceback (most recent call last):
  File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
Fatal Python error: Py_Initialize: Unable to get the locale encoding
  File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
ModuleNotFoundError: No module named 'codecs'
ArtPoon commented 3 years ago

Yeah it's an fstab issue:

[covizu@BEVi covizu]$ python3 -c "import codecs"
[covizu@BEVi covizu]$ bpsh 1 python3 -c "import codecs"
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Traceback (most recent call last):
  File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
ModuleNotFoundError: No module named 'codecs'
bpsh: Child process exited abnormally.
ArtPoon commented 3 years ago

Usual process for reconfiguring NFS on Scyld cluster:

Process failed:

[art@BEVi skycross]$ sudo bpsh -a mkdir -p /usr/lib64/python3.6
[art@BEVi skycross]$ sudo bpsh -a mount -t nfs -o nolock master:/usr/lib64/python3.6 /usr/lib64/python3.6
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
ArtPoon commented 3 years ago
drwxr-xr-x. 26 root root    20480 Jun 11  2019 python2.7
drwxr-xr-x  32 root root     8192 Jun 11  2019 python3.6

Permissions are the same - apparently the . has something to do with SELinux, but that is disabled.

ArtPoon commented 3 years ago

The above means for now, I can only launch processes on the head node. I really need to get this working for the compute nodes as well, so we can spin up at least 50 processes and do 100 bootstraps in two rounds!

ArtPoon commented 3 years ago

Run in verbose mode to try to get more info:

[art@BEVi ~]$ bpsh 0 python3 -v -c "a=1"
import _frozen_importlib # frozen
import _imp # builtin
import sys # builtin
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import _thread # previously loaded ('_thread')
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import _weakref # previously loaded ('_weakref')
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
# installing zipimport hook
import 'zipimport' # <class '_frozen_importlib.BuiltinImporter'>
# installed zipimport hook
# /usr/lib64/python3.6/encodings/__pycache__/__init__.cpython-36.pyc matches /usr/lib64/python3.6/encodings/__init__.py
# code object from '/usr/lib64/python3.6/encodings/__pycache__/__init__.cpython-36.pyc'
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'codecs'
# destroy encodings
bpsh: Child process exited abnormally.

Using strace to get even more info:

[art@BEVi ~]$ bpsh 0 strace python3
[... omitting heaps of stuff ...]
stat("/usr/lib64/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
openat(AT_FDCWD, "/usr/lib64/python3.6/lib-dynload", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 2 entries */, 32768)     = 48
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
write(2, "Fatal Python error: Py_Initializ"..., 69Fatal Python error: Py_Initialize: Unable to get the locale encoding

Yeah, it's an NFS issue alright:

[art@BEVi ~]$ bpsh 1 ls /usr/lib64/python3.6/lib-dynload
[art@BEVi ~]$ ls /usr/lib64/python3.6/lib-dynload
array.cpython-36m-x86_64-linux-gnu.so
_asyncio.cpython-36m-x86_64-linux-gnu.so
audioop.cpython-36m-x86_64-linux-gnu.so
binascii.cpython-36m-x86_64-linux-gnu.so
_bisect.cpython-36m-x86_64-linux-gnu.so
_blake2.cpython-36m-x86_64-linux-gnu.so
_bz2.cpython-36m-x86_64-linux-gnu.so
cmath.cpython-36m-x86_64-linux-gnu.so
_codecs_cn.cpython-36m-x86_64-linux-gnu.so
_codecs_hk.cpython-36m-x86_64-linux-gnu.so
_codecs_iso2022.cpython-36m-x86_64-linux-gnu.so
_codecs_jp.cpython-36m-x86_64-linux-gnu.so
_codecs_kr.cpython-36m-x86_64-linux-gnu.so
_codecs_tw.cpython-36m-x86_64-linux-gnu.so
[...]
ArtPoon commented 3 years ago
[art@BEVi ~]$ sudo bpsh -a mount -t nfs -o nolock -vvvv master:/usr/lib64/python3.6 /usr/lib64/python3.6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
mount.nfs: timeout set for Mon Dec  7 23:28:00 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.11'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
mount.nfs: timeout set for Mon Dec  7 23:28:01 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.10'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: timeout set for Mon Dec  7 23:28:01 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.12'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: timeout set for Mon Dec  7 23:28:01 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.13'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
ArtPoon commented 3 years ago

Added the following lines to hosts.allow:

rpcbind : ALL : ALLOW
mountd : ALL : ALLOW

https://stackoverflow.com/questions/13111910/rpc-authentication-error

Success!

ArtPoon commented 3 years ago

Ah, we're not done - need to get batch.py tested.

ArtPoon commented 3 years ago

Installed dependencies:

ArtPoon commented 3 years ago

Exception thrown by TreeTime:

Inferred sequence evolution model (saved as data/molecular_clock.txt):
Root-Tip-Regression:
 --rate:        8.000e-04
 --r^2:         0.53

--- saved tree as 
         data/timetree.pdf

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/matplotlib-3.3.3-py3.6-linux-x86_64.egg/matplotlib/mathtext.py", line 2609, in parse
    result = self._expression.parseString(s)
  File "/usr/local/lib/python3.6/site-packages/pyparsing-3.0.0b1-py3.6.egg/pyparsing/core.py", line 869, in parseString
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseFatalException: Unknown symbol: \beta, found '\'  (at char 11), (line:1, col:12)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/treetime", line 4, in <module>
    __import__('pkg_resources').run_script('phylo-treetime==0.8.0', 'treetime')
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 654, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(script_code, namespace, namespace)
ArtPoon commented 3 years ago

I modified the TreeTime code to skip writing the RTT plot if the user passes the argument --plot-rtt none. Analysis runs to completion on same temporary data files now:

[covizu@BEVi covizu]$ treetime --tree /tmp/cvz_tt_nlqfuh2j --aln /tmp/cvz_tt_30987v78 --dates /tmp/cvz_tt_db7v7b9z --outdir data/ --verbose 0 --clock-rate 0.0008 --plot-rtt none
Attempting to parse dates...
    Using column 'name' as name. This needs match the taxon names in the tree!!
    Using column 'date' as date.

0.00    -TreeAnc: set-up

4.49    WARNING: Previous versions of TreeTime (<0.7.0) RECONSTRUCTED sequences of
        tips at positions with AMBIGUOUS bases. This resulted in unexpected
        behavior is some cases and is no longer done by default. If you want to
        replace those ambiguous sites with their most likely state, rerun with
        `reconstruct_tip_states=True` or `--reconstruct-tip-states`.

71.63   TreeTime: the following tips have been marked as outliers. Their date
        constraints were not used. Please remove them from the tree. Their dates
        have been reset:

71.63   B.1.129, input date: 2020.4986338797814, apparent date: 2021.05

71.63   B.1.130, input date: 2020.6516393442623, apparent date: 2022.39

Inferred sequence evolution model (saved as data/sequence_evolution_model.txt):
Substitution rate (mu): 1.0

Equilibrium frequencies (pi_i):
  A: 0.2976
  C: 0.1706
  G: 0.1876
  T: 0.3342
  -: 0.01

Symmetrized rates from j->i (W_ij):
    A   C   G   T   -
  A 0   0.2795  1.0439  0.1494  21.7918
  C 0.2795  0   0.2646  3.5697  22.1119
  G 1.0439  0.2646  0   1.3013  21.9549
  T 0.1494  3.5697  1.3013  0   25.5326
  - 21.7918 22.1119 21.9549 25.5326 0

Actual rates from j->i (Q_ij):
    A   C   G   T   -
  A 0   0.0832  0.3107  0.0445  6.4862
  C 0.0477  0   0.0451  0.6089  3.7715
  G 0.1959  0.0496  0   0.2442  4.1193
  T 0.0499  1.193   0.4349  0   8.5331
  - 0.2171  0.2203  0.2187  0.2544  0

Inferred sequence evolution model (saved as data/molecular_clock.txt):
Root-Tip-Regression:
 --rate:    8.000e-04
 --r^2:     0.46

--- saved tree as 
     data/timetree.pdf

--- alignment including ancestral nodes saved as  
     data/ancestral_sequences.fasta

--- saved divergence times in 
     data/dates.tsv

--- tree saved in nexus format as  
     data/timetree.nexus

--- divergence tree saved in nexus format as  
     data/divergence_tree.nexus
ArtPoon commented 3 years ago

Running default --mincount at 5000 was too slow because processing lineages in serial mode is slow. Trying another run at 500.

ArtPoon commented 3 years ago

Geh, forgot to load the openmpi module:

🦠 [0:11:01.700436] start B, 2841 entries
Traceback (most recent call last):
  File "batch.py", line 212, in <module>
    result = make_beadplots(by_lineage, args, cb.callback)
  File "batch.py", line 180, in make_beadplots
    args.bylineage, "--nboot", str(args.nboot), "--outdir", "data"]
  File "/usr/lib64/python3.6/subprocess.py", line 306, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python3.6/subprocess.py", line 287, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun': 'mpirun'
[covizu@BEVi covizu]$ module load openmpi/gnu
[covizu@BEVi covizu]$ which mpirun
/opt/scyld/openmpi/2.0.4/gnu/bin/mpirun
[covizu@BEVi covizu]$ nohup python3 batch.py data/provision.json.xz data/clusters.json --mmthreads 8 --mincount 500 &
[1] 92525
[covizu@BEVi covizu]$ nohup: ignoring input and appending output to ‘nohup.out’
ArtPoon commented 3 years ago

Still no:

🦠 [0:11:19.897082] start B, 2841 entries
Traceback (most recent call last):
  File "batch.py", line 212, in <module>
    result = make_beadplots(by_lineage, args, cb.callback)
  File "batch.py", line 180, in make_beadplots
    args.bylineage, "--nboot", str(args.nboot), "--outdir", "data"]
  File "/usr/lib64/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '--machinefile', 'mfile', 'python3', 'covizu/clustering.py', 'data/by_lineage.json', '--nboot', '100', '--outdir', 'data']' returned non-zero exit status 2.

Looks like a missing argument.

ArtPoon commented 3 years ago
🦠 [0:11:13.019709] start B, 2841 entries
Traceback (most recent call last):
  File "batch.py", line 213, in <module>
    result = make_beadplots(by_lineage, args, cb.callback)
  File "batch.py", line 193, in make_beadplots
    ctree = clustering.consensus(trees, cutoff=args.cutoff)
  File "/home/covizu/covizu/covizu/clustering.py", line 137, in consensus
    trees = list(trees)
  File "/home/covizu/.local/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 49, in parse
    yield from getattr(supported_formats[format], "parse")(fp, **kwargs)
  File "/home/covizu/.local/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 52, in parse
    return Parser(handle).parse(**kwargs)
  File "/home/covizu/.local/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 104, in __init__
    if handle.read(0) != "":
ValueError: I/O operation on closed file.
ArtPoon commented 3 years ago

Running over night, nohup.out contained the following error:

<U+1F9A0> [0:31:15.370835]   row 20100 of 20347
Traceback (most recent call last):
  File "covizu/clustering.py", line 266, in <module>
    phy = bootstrap(union, indexed, args.binpath, callback=cb.callback)
  File "covizu/clustering.py", line 100, in bootstrap
    stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
  File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib64/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib64/python3.6/subprocess.py", line 1295, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I'm reducing the number of processes distributed across the nodes from 50 to around 33, so we do 100 replicates in three rounds, not two. Doing a trial run on lineage B.1:

[covizu@BEVi covizu]$ mpirun --machinefile mfile python3 covizu/clustering.py data/by_lineage.json B.1 --nboot 100 -o working/
🦠 [0:00:00.000024] loading JSON
[...]
🦠 [0:28:33.654588]   row 20000 of 20029

RapidNJ is running on all cores right now, should take roughly half an hour. RAM consumption seems OK (about 2.5GB per process).

ArtPoon commented 3 years ago

It took a bit under 5 hours to process all 100 trees for lineage B.1.1 (20347 variants). Generating the consensus for these trees took up a lot of RAM (over 50% of about 64GB), need to keep an eye on this.

ArtPoon commented 3 years ago

Successfully transitioned to the next big lineage!

ArtPoon commented 3 years ago

Crap, that one tip tree problem again!

🧿 [11:23:23.441752] start B.1.1.62, 28 entries
🧿 [11:23:46.936157] start B.1.1.27, 125 entries
🧿 [11:24:12.279722] start B.1.1.8, 279 entries
🧿 [11:24:55.271488] start B.1.5.24, 200 entries
🧿 [11:25:27.534760] start B.1.165, 84 entries
🧿 [11:25:51.793485] start B.2.5, 1 entries
Traceback (most recent call last):
  File "batch.py", line 221, in <module>
    result = make_beadplots(by_lineage, args, cb.callback, t0=cb.t0.timestamp())
  File "batch.py", line 176, in make_beadplots
    beaddict = beadplot_serial(lineage, features, args)
  File "batch.py", line 114, in beadplot_serial
    trees, labels = clustering.build_trees(features, args, callback=callback)
  File "/home/covizu/covizu/covizu/clustering.py", line 208, in build_trees
    for _ in range(args.nboot)]
  File "/home/covizu/covizu/covizu/clustering.py", line 208, in <listcomp>
    for _ in range(args.nboot)]
  File "/home/covizu/covizu/covizu/clustering.py", line 102, in bootstrap
    stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
  File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['rapidnj', '/tmp/cvz_boot_n7tdp3fo', '-i', 'pd', '--no-negative-length']' returned non-zero exit status 1.
ArtPoon commented 3 years ago
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND  
 75805 covizu    20   0   17.4g  14.5g  14152 R 100.3 23.2  15:18.81 python3 
ArtPoon commented 3 years ago

BWAHAHAHA I forgot to add the filtering step into the new pipeline in the dev branch (integrating GISAID feed, #186) - this makes a HUGE difference - lineage B.1.1 reduces down to 13073 variants, memory consumption is way down and speed will be way up.

ArtPoon commented 3 years ago

fwiw I re-optimized the consensus tree generation function to reduce memory consumption, see commit de1d986fefa8552ce69b81e6542c65fded777701 and onward

ArtPoon commented 3 years ago

So close...

🏄 [3:14:16.095184] start B.1.1.54, 365 entries
🏄 [3:15:10.230928] start B.1.148, 2 entries
ERROR: mismatch detected between tree and label file:
{'hCoV-19/England/LIVE-9A117/2020|EPI_ISL_439848|2020-04-17'}
ArtPoon commented 3 years ago

Closing with successful run on cluster