Closed ArtPoon closed 3 years ago
Note the cluster is running on CentOS so there may be some incompatibilities to look out for!
created covizu
user on BEVi
pulled covizu from repo
python3 setup.py install --user
module load openmpi/gnu
attempted mpirun -np 3 python3 covizu/clustering.py
failed with missing scipy error
sudo yum install python3-pip
(with my user account)pip3 install scipy --user
Successfully installed numpy-1.19.4 scipy-1.5.4
again failed with missing biopython error
pip3 install biopython --user
Successfully installed biopython-1.78
another error:
[covizu@BEVi covizu]$ mpirun -np 3 python3 covizu/clustering.py
Script requires mpi4py - https://pypi.org/project/mpi4py/
pip3 install mpi4py --user
Successfully installed mpi4py-3.0.3
first real bug:
[covizu@BEVi covizu]$ mpirun -np 3 python3 covizu/clustering.py data/by_lineage.json B
🦠 [0:00:00.000014] loading JSON
Traceback (most recent call last):
File "covizu/clustering.py", line 243, in <module>
with open(args.infile) as handle:
AttributeError: 'Namespace' object has no attribute 'infile'
Successful run on head node:
[covizu@BEVi covizu]$ mpirun -np 3 python3 covizu/clustering.py data/by_lineage.json B --nboot 3 --outdir data
🦠 [0:00:00.000017] loading JSON
🦠 [0:00:00.000015] loading JSON
🦠 [0:00:00.000015] loading JSON
🦠 [0:00:05.165614] Reduced to 1571 variants; generating feature set union
[...]
🦠 [0:00:11.954229] row 1500 of 1571
🦠 [0:00:11.963294] row 1500 of 1571
but running on other nodes failed:
[covizu@BEVi covizu]$ mpirun --machinefile mfile python3 covizu/clustering.py data/by_lineage.json B --nboot 10 --outdir data
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Traceback (most recent call last):
File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
Fatal Python error: Py_Initialize: Unable to get the locale encoding
File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
ModuleNotFoundError: No module named 'codecs'
Yeah it's an fstab
issue:
[covizu@BEVi covizu]$ python3 -c "import codecs"
[covizu@BEVi covizu]$ bpsh 1 python3 -c "import codecs"
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Traceback (most recent call last):
File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
ModuleNotFoundError: No module named 'codecs'
bpsh: Child process exited abnormally.
Usual process for reconfiguring NFS on Scyld cluster:
/etc/exports
/etc/exports
:
/usr/lib64/python3.6 @cluster(ro,sync,no_root_squash)
sudo exportfs -a
/etc/beowulf/fstab
/etc/beowulf/fstab
:
$MASTER:/usr/lib64/python3.6 /usr/lib64/python3.6 nfs nolock,nonfatal 0 0
Process failed:
[art@BEVi skycross]$ sudo bpsh -a mkdir -p /usr/lib64/python3.6
[art@BEVi skycross]$ sudo bpsh -a mount -t nfs -o nolock master:/usr/lib64/python3.6 /usr/lib64/python3.6
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
drwxr-xr-x. 26 root root 20480 Jun 11 2019 python2.7
drwxr-xr-x 32 root root 8192 Jun 11 2019 python3.6
Permissions are the same - apparently the .
has something to do with SELinux, but that is disabled.
The above means for now, I can only launch processes on the head node. I really need to get this working for the compute nodes as well, so we can spin up at least 50 processes and do 100 bootstraps in two rounds!
Run in verbose mode to try to get more info:
[art@BEVi ~]$ bpsh 0 python3 -v -c "a=1"
import _frozen_importlib # frozen
import _imp # builtin
import sys # builtin
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import _thread # previously loaded ('_thread')
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import _weakref # previously loaded ('_weakref')
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
# installing zipimport hook
import 'zipimport' # <class '_frozen_importlib.BuiltinImporter'>
# installed zipimport hook
# /usr/lib64/python3.6/encodings/__pycache__/__init__.cpython-36.pyc matches /usr/lib64/python3.6/encodings/__init__.py
# code object from '/usr/lib64/python3.6/encodings/__pycache__/__init__.cpython-36.pyc'
Fatal Python error: Py_Initialize: Unable to get the locale encoding
Traceback (most recent call last):
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/lib64/python3.6/encodings/__init__.py", line 31, in <module>
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'codecs'
# destroy encodings
bpsh: Child process exited abnormally.
Using strace
to get even more info:
[art@BEVi ~]$ bpsh 0 strace python3
[... omitting heaps of stuff ...]
stat("/usr/lib64/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
openat(AT_FDCWD, "/usr/lib64/python3.6/lib-dynload", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 2 entries */, 32768) = 48
getdents(3, /* 0 entries */, 32768) = 0
close(3) = 0
write(2, "Fatal Python error: Py_Initializ"..., 69Fatal Python error: Py_Initialize: Unable to get the locale encoding
Yeah, it's an NFS issue alright:
[art@BEVi ~]$ bpsh 1 ls /usr/lib64/python3.6/lib-dynload
[art@BEVi ~]$ ls /usr/lib64/python3.6/lib-dynload
array.cpython-36m-x86_64-linux-gnu.so
_asyncio.cpython-36m-x86_64-linux-gnu.so
audioop.cpython-36m-x86_64-linux-gnu.so
binascii.cpython-36m-x86_64-linux-gnu.so
_bisect.cpython-36m-x86_64-linux-gnu.so
_blake2.cpython-36m-x86_64-linux-gnu.so
_bz2.cpython-36m-x86_64-linux-gnu.so
cmath.cpython-36m-x86_64-linux-gnu.so
_codecs_cn.cpython-36m-x86_64-linux-gnu.so
_codecs_hk.cpython-36m-x86_64-linux-gnu.so
_codecs_iso2022.cpython-36m-x86_64-linux-gnu.so
_codecs_jp.cpython-36m-x86_64-linux-gnu.so
_codecs_kr.cpython-36m-x86_64-linux-gnu.so
_codecs_tw.cpython-36m-x86_64-linux-gnu.so
[...]
[art@BEVi ~]$ sudo bpsh -a mount -t nfs -o nolock -vvvv master:/usr/lib64/python3.6 /usr/lib64/python3.6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
mount.nfs: timeout set for Mon Dec 7 23:28:00 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.11'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
mount.nfs: timeout set for Mon Dec 7 23:28:01 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.10'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: timeout set for Mon Dec 7 23:28:01 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.12'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
mount.nfs: mount(2): No such file or directory
mount.nfs: portmap query failed: RPC: Authentication error
mount.nfs: timeout set for Mon Dec 7 23:28:01 2020
mount.nfs: trying text-based options 'nolock,vers=4.1,addr=172.16.128.8,clientaddr=172.16.128.13'
mount.nfs: trying text-based options 'nolock,addr=172.16.128.8'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: access denied by server while mounting master:/usr/lib64/python3.6
Added the following lines to hosts.allow
:
rpcbind : ALL : ALLOW
mountd : ALL : ALLOW
https://stackoverflow.com/questions/13111910/rpc-authentication-error
Success!
Ah, we're not done - need to get batch.py
tested.
Installed dependencies:
minimap2
version 2.17 and installed/usr/local/bin/fasttree2-2.1.10
Exception thrown by TreeTime:
Inferred sequence evolution model (saved as data/molecular_clock.txt):
Root-Tip-Regression:
--rate: 8.000e-04
--r^2: 0.53
--- saved tree as
data/timetree.pdf
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/matplotlib-3.3.3-py3.6-linux-x86_64.egg/matplotlib/mathtext.py", line 2609, in parse
result = self._expression.parseString(s)
File "/usr/local/lib/python3.6/site-packages/pyparsing-3.0.0b1-py3.6.egg/pyparsing/core.py", line 869, in parseString
raise exc.with_traceback(None)
pyparsing.exceptions.ParseFatalException: Unknown symbol: \beta, found '\' (at char 11), (line:1, col:12)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/treetime", line 4, in <module>
__import__('pkg_resources').run_script('phylo-treetime==0.8.0', 'treetime')
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 654, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1441, in run_script
exec(script_code, namespace, namespace)
I modified the TreeTime code to skip writing the RTT plot if the user passes the argument --plot-rtt none
. Analysis runs to completion on same temporary data files now:
[covizu@BEVi covizu]$ treetime --tree /tmp/cvz_tt_nlqfuh2j --aln /tmp/cvz_tt_30987v78 --dates /tmp/cvz_tt_db7v7b9z --outdir data/ --verbose 0 --clock-rate 0.0008 --plot-rtt none
Attempting to parse dates...
Using column 'name' as name. This needs match the taxon names in the tree!!
Using column 'date' as date.
0.00 -TreeAnc: set-up
4.49 WARNING: Previous versions of TreeTime (<0.7.0) RECONSTRUCTED sequences of
tips at positions with AMBIGUOUS bases. This resulted in unexpected
behavior is some cases and is no longer done by default. If you want to
replace those ambiguous sites with their most likely state, rerun with
`reconstruct_tip_states=True` or `--reconstruct-tip-states`.
71.63 TreeTime: the following tips have been marked as outliers. Their date
constraints were not used. Please remove them from the tree. Their dates
have been reset:
71.63 B.1.129, input date: 2020.4986338797814, apparent date: 2021.05
71.63 B.1.130, input date: 2020.6516393442623, apparent date: 2022.39
Inferred sequence evolution model (saved as data/sequence_evolution_model.txt):
Substitution rate (mu): 1.0
Equilibrium frequencies (pi_i):
A: 0.2976
C: 0.1706
G: 0.1876
T: 0.3342
-: 0.01
Symmetrized rates from j->i (W_ij):
A C G T -
A 0 0.2795 1.0439 0.1494 21.7918
C 0.2795 0 0.2646 3.5697 22.1119
G 1.0439 0.2646 0 1.3013 21.9549
T 0.1494 3.5697 1.3013 0 25.5326
- 21.7918 22.1119 21.9549 25.5326 0
Actual rates from j->i (Q_ij):
A C G T -
A 0 0.0832 0.3107 0.0445 6.4862
C 0.0477 0 0.0451 0.6089 3.7715
G 0.1959 0.0496 0 0.2442 4.1193
T 0.0499 1.193 0.4349 0 8.5331
- 0.2171 0.2203 0.2187 0.2544 0
Inferred sequence evolution model (saved as data/molecular_clock.txt):
Root-Tip-Regression:
--rate: 8.000e-04
--r^2: 0.46
--- saved tree as
data/timetree.pdf
--- alignment including ancestral nodes saved as
data/ancestral_sequences.fasta
--- saved divergence times in
data/dates.tsv
--- tree saved in nexus format as
data/timetree.nexus
--- divergence tree saved in nexus format as
data/divergence_tree.nexus
Running default --mincount
at 5000
was too slow because processing lineages in serial mode is slow. Trying another run at 500
.
Geh, forgot to load the openmpi module:
🦠 [0:11:01.700436] start B, 2841 entries
Traceback (most recent call last):
File "batch.py", line 212, in <module>
result = make_beadplots(by_lineage, args, cb.callback)
File "batch.py", line 180, in make_beadplots
args.bylineage, "--nboot", str(args.nboot), "--outdir", "data"]
File "/usr/lib64/python3.6/subprocess.py", line 306, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib64/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun': 'mpirun'
[covizu@BEVi covizu]$ module load openmpi/gnu
[covizu@BEVi covizu]$ which mpirun
/opt/scyld/openmpi/2.0.4/gnu/bin/mpirun
[covizu@BEVi covizu]$ nohup python3 batch.py data/provision.json.xz data/clusters.json --mmthreads 8 --mincount 500 &
[1] 92525
[covizu@BEVi covizu]$ nohup: ignoring input and appending output to ‘nohup.out’
Still no:
🦠 [0:11:19.897082] start B, 2841 entries
Traceback (most recent call last):
File "batch.py", line 212, in <module>
result = make_beadplots(by_lineage, args, cb.callback)
File "batch.py", line 180, in make_beadplots
args.bylineage, "--nboot", str(args.nboot), "--outdir", "data"]
File "/usr/lib64/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '--machinefile', 'mfile', 'python3', 'covizu/clustering.py', 'data/by_lineage.json', '--nboot', '100', '--outdir', 'data']' returned non-zero exit status 2.
Looks like a missing argument.
🦠 [0:11:13.019709] start B, 2841 entries
Traceback (most recent call last):
File "batch.py", line 213, in <module>
result = make_beadplots(by_lineage, args, cb.callback)
File "batch.py", line 193, in make_beadplots
ctree = clustering.consensus(trees, cutoff=args.cutoff)
File "/home/covizu/covizu/covizu/clustering.py", line 137, in consensus
trees = list(trees)
File "/home/covizu/.local/lib/python3.6/site-packages/Bio/Phylo/_io.py", line 49, in parse
yield from getattr(supported_formats[format], "parse")(fp, **kwargs)
File "/home/covizu/.local/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 52, in parse
return Parser(handle).parse(**kwargs)
File "/home/covizu/.local/lib/python3.6/site-packages/Bio/Phylo/NewickIO.py", line 104, in __init__
if handle.read(0) != "":
ValueError: I/O operation on closed file.
Running over night, nohup.out
contained the following error:
<U+1F9A0> [0:31:15.370835] row 20100 of 20347
Traceback (most recent call last):
File "covizu/clustering.py", line 266, in <module>
phy = bootstrap(union, indexed, args.binpath, callback=cb.callback)
File "covizu/clustering.py", line 100, in bootstrap
stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib64/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1295, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
I'm reducing the number of processes distributed across the nodes from 50 to around 33, so we do 100 replicates in three rounds, not two. Doing a trial run on lineage B.1:
[covizu@BEVi covizu]$ mpirun --machinefile mfile python3 covizu/clustering.py data/by_lineage.json B.1 --nboot 100 -o working/
🦠 [0:00:00.000024] loading JSON
[...]
🦠 [0:28:33.654588] row 20000 of 20029
RapidNJ is running on all cores right now, should take roughly half an hour. RAM consumption seems OK (about 2.5GB per process).
It took a bit under 5 hours to process all 100 trees for lineage B.1.1 (20347 variants). Generating the consensus for these trees took up a lot of RAM (over 50% of about 64GB), need to keep an eye on this.
Successfully transitioned to the next big lineage!
Crap, that one tip tree problem again!
🧿 [11:23:23.441752] start B.1.1.62, 28 entries
🧿 [11:23:46.936157] start B.1.1.27, 125 entries
🧿 [11:24:12.279722] start B.1.1.8, 279 entries
🧿 [11:24:55.271488] start B.1.5.24, 200 entries
🧿 [11:25:27.534760] start B.1.165, 84 entries
🧿 [11:25:51.793485] start B.2.5, 1 entries
Traceback (most recent call last):
File "batch.py", line 221, in <module>
result = make_beadplots(by_lineage, args, cb.callback, t0=cb.t0.timestamp())
File "batch.py", line 176, in make_beadplots
beaddict = beadplot_serial(lineage, features, args)
File "batch.py", line 114, in beadplot_serial
trees, labels = clustering.build_trees(features, args, callback=callback)
File "/home/covizu/covizu/covizu/clustering.py", line 208, in build_trees
for _ in range(args.nboot)]
File "/home/covizu/covizu/covizu/clustering.py", line 208, in <listcomp>
for _ in range(args.nboot)]
File "/home/covizu/covizu/covizu/clustering.py", line 102, in bootstrap
stdout = subprocess.check_output(cmd, stderr=subprocess.DEVNULL)
File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib64/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['rapidnj', '/tmp/cvz_boot_n7tdp3fo', '-i', 'pd', '--no-negative-length']' returned non-zero exit status 1.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
75805 covizu 20 0 17.4g 14.5g 14152 R 100.3 23.2 15:18.81 python3
BWAHAHAHA I forgot to add the filtering step into the new pipeline in the dev
branch (integrating GISAID feed, #186) - this makes a HUGE difference - lineage B.1.1 reduces down to 13073 variants, memory consumption is way down and speed will be way up.
fwiw I re-optimized the consensus tree generation function to reduce memory consumption, see commit de1d986fefa8552ce69b81e6542c65fded777701 and onward
So close...
🏄 [3:14:16.095184] start B.1.1.54, 365 entries
🏄 [3:15:10.230928] start B.1.148, 2 entries
ERROR: mismatch detected between tree and label file:
{'hCoV-19/England/LIVE-9A117/2020|EPI_ISL_439848|2020-04-17'}
Closing with successful run on cluster
scp