PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

Move backend processing back from Paphlagon to BEVi #499

Closed ArtPoon closed 3 months ago

ArtPoon commented 8 months ago

Now that I've done the RAM upgrade, we should restore the data processing workflow on the cluster

ArtPoon commented 8 months ago

We'll need to install R package dependencies to handle the number of infections model, which hasn't been run on BEVi before.

GopiGugan commented 8 months ago
[gopigugan@BEVi ~]$ python3 --version
Python 3.6.8
[gopigugan@BEVi ~]$ R --version
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

We are currently running older versions of Python and R. Should these be updated on BEVi?

ArtPoon commented 7 months ago

My subscription for the clusterware system has expired so I cannot use the package manager to update python on BEVi. The easiest workaround would probably be to do a local installation (i.e. into /usr/local/bin) of a newer version of Python and make sure that it is in the $PATH. Same thing to get R up to version 4.0.

ArtPoon commented 7 months ago
ArtPoon commented 7 months ago

@GopiGugan to run some tests on BEVi before we switch back over

GopiGugan commented 7 months ago

Running into issues installing the tidyquant R package on BEVi:

[gopigugan@BEVi ~]$ R -e "install.packages('tidyquant',dependencies=TRUE, repos='http://cran.rstudio.com/')"
...
ERROR: dependency ‘textshaping’ is not available for package ‘ragg’
...
ERROR: dependency ‘ragg’ is not available for package ‘tidyverse’
...
ERROR: dependency ‘tidyverse’ is not available for package ‘tidyquant’

Looks like there are some dependencies not available for packages on version 4.3.2 of R. We are currently using version 4.2.2 on Paphlagon.

Downgrading R from version 4.3.2 to 4.2.2 on BEVi

GopiGugan commented 7 months ago
GopiGugan commented 7 months ago

Successfully installed R packages. Now running into error when installing rpy2 package:

# pip3 install .
Processing /home/gopigugan/rpy2-RELEASE_3_5_14
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [45 lines of output]
      R was not built as a library
      /home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:335: UserWarning: No libraries as -l arguments to the compiler.
        warnings.warn('No libraries as -l arguments to the compiler.')
      R was not built as a library
      /home/gopigugan/rpy2-RELEASE_3_5_14/./rpy2/situation.py:322: UserWarning: No include specified
        warnings.warn('No include specified')
      /tmp/tmp_pw_r_7nwyffu6/test_pw_r.c:1:10: fatal error: Rinterface.h: No such file or directory
          1 | #include <Rinterface.h>
GopiGugan commented 7 months ago

Issue seems to be the following: R was not built as a library

Reinstalling R version 4.2.2 with the --enable-R-shlib option:

make clean
./configure --prefix=/usr/local --enable-R-shlib
make
make install
GopiGugan commented 7 months ago

rpy2 successfully installed but error importing rpy2

# python3
Python 3.11.3 (main, Jan 16 2024, 01:12:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rpy2.robjects import pandas2ri
Error in glue(.Internal(R.home()), "library", "base", "R", "base", sep = .Platform$file.sep) :
  4 arguments passed to .Internal(paste) which requires 3
Error: could not find function "attach"
Error: object '.ArgsEnv' not found
Fatal error: unable to initialize the JIT

Had to set the following variable to resolve error: export LD_LIBRARY_PATH="$(python3 -m rpy2.situation LD_LIBRARY_PATH)":${LD_LIBRARY_PATH}

GopiGugan commented 7 months ago

Pipeline ran successfully with the test data file:

[covizu@BEVi covizu]$ python3 batch.py --dry-run --infile dev.2000.json.xz
🏄 [0:00:01.038814] Processing GISAID feed data
🏄 [0:00:03.346096] aligned 0 records
🏄 [0:00:03.430148] filtered 1066 problematic features
🏄 [0:00:03.430193]          671 genomes with excess missing sites
🏄 [0:00:03.430204]          163 genomes with excess divergence
🏄 [0:00:03.430838] Parsing Pango lineage designations
🏄 [0:00:05.122239] Identifying lineage representative genomes
🏄 [0:00:05.185900] Reconstructing tree with fasttree2
FastTree Version 2.1.11 Double precision (No SSE3)
...
🏄 [0:01:58.282415][5/56] starting BA.2.1
🏄 [0:02:04.622877][0/56] starting BF.7.5
🏄 [0:02:04.949943][0/56] starting BA.5.1.3
🏄 [0:02:05.022366] Parsing output files
R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

🏄 [0:03:30.864626] All done!

Initiated a dry run to verify there are no issues: nohup python3 batch.py --dry-run > ~/iss499.log &

ArtPoon commented 6 months ago

@GopiGugan reports a successful run

ArtPoon commented 6 months ago
ArtPoon commented 6 months ago
ArtPoon commented 6 months ago

Obviously this is on hold until we can get the damn cluster back online (#516)

GopiGugan commented 5 months ago

Currently building database on BEVi (#493, #485)

GopiGugan commented 5 months ago

Investigating a KeyError while building the database:

Traceback (most recent call last):
  File "/home/covizu/covizu/batch.py", line 250, in <module>
    by_lineage = process_feed(args, cur, cb.callback)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/covizu/covizu/batch.py", line 179, in process_feed
    return gisaid_utils.sort_by_lineage(filtered, callback=callback)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 277, in sort_by_lineage
    for i, record in enumerate(records):
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 220, in filter_problematic
    for record in records:
  File "/home/covizu/covizu/covizu/utils/gisaid_utils.py", line 179, in extract_features
    record = new_records[qname]
             ~~~~~~~~~~~^^^^^^^
KeyError: 'hCoV-19/South'
GopiGugan commented 5 months ago

The issue is that when there is a space in the virus name (qname), e.g. hCoV-19/South Africa/..... it gets cut off in the minimap2 output:

https://github.com/PoonLab/covizu/blob/db11b2f0d0776064a63a87ff27c11fcf3a01d552/covizu/minimap2.py#L67-L72

So the output was failing when trying to retrieve a record by qname: https://github.com/PoonLab/covizu/blob/db11b2f0d0776064a63a87ff27c11fcf3a01d552/covizu/utils/gisaid_utils.py#L177-L180

Pipeline is also failing because we are retrieving records and inserting records into the database based on the qname instead of the accession id and qname is not unique:

https://github.com/PoonLab/covizu/blob/db11b2f0d0776064a63a87ff27c11fcf3a01d552/covizu/utils/gisaid_utils.py#L122-L124

https://github.com/PoonLab/covizu/blob/db11b2f0d0776064a63a87ff27c11fcf3a01d552/covizu/utils/gisaid_utils.py#L179-L187

ArtPoon commented 4 months ago

Let's write database dumps to the filesystem on the following basis:

ArtPoon commented 4 months ago

@GopiGugan testing out script for clearing out expired logs

ArtPoon commented 4 months ago

@GopiGugan to push the clean up script to repo and close