databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

checkinstall pipe from web #189

Closed nsheff closed 2 years ago

nsheff commented 3 years ago

I just pushed a few updates to checkinstall on the dev branch.

https://github.com/databio/pepatac/commit/5bd07e1086dc541693a0efbde3c93dfba634f288

I would like to be able to do this:

curl -sSL https://raw.githubusercontent.com/databio/pepatac/dev/checkinstall | bash

This way, the person doesn't have to have cloned the repository or whatever, and can just see if the pipeline can go. Right now this works fine, except for 2 issues:

  1. it's using requirements.txt for Python requirements. That's great because it makes it so you don't have to maintain 2 lists of python package requirements... but unfortunately it makes is so the checkinstall script is not standalone, and requires the repository to function, so it can't be run remotely like this.

I think it would be worth duplicating the python package requirement list into the checkinstall script to make it self-contained. Then, there could perhaps be a little sed one-liner that would update that list from the requirements.txt file, if you want to make it easier to maintain.

  1. I'm getting this error during the bulker test:
Checking bulker installation...                            
✔ SUCCESS: bulker
grep: /home/nsheff/code/pepatac_paper/sample_pipeline_interface.yaml: No such file or directory
crate: 
usage: bulker load [-h] [-c CONFIG] [-m MANIFEST] [-p PATH] [-b] [-f] [-r] crate-registry-paths
bulker load: error: the following arguments are required: crate-registry-paths
Bulker config: /home/nsheff/Dropbox/env/bulker_config/zither.yaml
Activating crate: /home/nsheff/code/pepatac_paper/pipelines/pepatac.py

Traceback (most recent call last):
  File "/home/nsheff/.local/bin/bulker", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/bulker/bulker.py", line 1349, in main
    bulker_run(bulker_config, cratelist, args.cmd, strict=args.strict)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bulker/bulker.py", line 824, in bulker_run
    newpath = get_new_PATH(bulker_config, cratelist, strict)
  File "/home/nsheff/.local/lib/python3.8/site-packages/bulker/bulker.py", line 812, in get_new_PATH
    cratepaths += get_local_path(bulker_config, cratevars) + os.pathsep
  File "/home/nsheff/.local/lib/python3.8/site-packages/bulker/bulker.py", line 797, in get_local_path
    _LOGGER.debug(bulker_config.bulker.crates[cratevars["namespace"]][cratevars["crate"]].to_dict())
TypeError: 'NoneType' object is not subscriptable
✔ SUCCESS: bulker run 

thoughts?

jpsmith5 commented 3 years ago

Okay. Try the one on dev again. 1) Will use the requirements.txt if present, otherwise will go through a provided list that we'll have to keep updated. 2). This section was also expecting to be in the pepatac/ repo, and was thus using files that should be there. Addressed similarly to 1) to resolve. That NoneType error was coming from that aspect of trying to grab information from a file that wasn't there, and thus there was a blank in the test command.

nsheff commented 3 years ago

Ok, it worked for me. When I run the checkinstall script, it is pulling a docker container. Is that intentional? I'm guessing somewhere in the bulker check it's trying to run the commands, and if they aren't there, then it pulls them? Do we want the checkinstall script to pull all of the docker containers?

jpsmith5 commented 3 years ago

Hmm, yeah I see. Right now, if you run it in the folder, it tries to run the pipeline itself using bulker run. To allow it to be run from the web, I have an alternate check to just see if it can bulker run bowtie2 --help for example, just to confirm one of the crate functions is actually runnable. We can try to avoid that, and do something different to confirm, or yeah, is it better to go ahead and pull all the required containers and confirm they are all runnable?

nsheff commented 3 years ago

1) Will use the requirements.txt if present, otherwise will go through a provided list that we'll have to keep updated. 2). This section was also expecting to be in the pepatac/ repo, and was thus using files that should be there. Addressed similarly to 1) to resolve. That NoneType error was coming from that aspect of trying to grab information from a file that wasn't there, and thus there was a blank in the test command.

Just had an idea -- couldn't it just curl the list so you don't have to duplicate it? In other words, if I do ./checkinstall locally, it finds the local requirements.txt file, no issue. If I do curl -sSL https://raw.githubusercontent.com/databio/pepatac/dev/checkinstall | bash, then it realizes it doesn't have it locally... it just curls https://raw.githubusercontent.com/databio/pepatac/master/requirements.txt

Just do this: REQS=$(curl https://raw.githubusercontent.com/databio/pepatac/master/requirements.txt) in the shell script. Now no need to maintain 2 lists ? The only disadvantage is that it only works on an internet-connected system. But if you're already running it via curl then you've shown that you can curl stuff.

nsheff@zither:~$ REQS=$(curl https://raw.githubusercontent.com/databio/pepatac/master/requirements.txt)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   390  100   390    0     0   3786      0 --:--:-- --:--:-- --:--:--  3786
nsheff@zither:~$ echo $REQS
attmap>=0.12.9 bio>=0.2.4 codecov>=2.0 colorama>=0.3.9 Cython>=0.29 cykhash>=1.0.2 divvy>=0.5.0 eido>=0.1.3 #fseq2>=2.0.2 hypothesis==4.38.0 jinja2 jsonschema>=3.0.1 logmuse>=0.2.5 looper>=1.2.1 MACS2>=2.2.7.1 numpy>=1.17 oyaml pararead pandas>=0.20.2 peppy>=0.31.0 piper psutil pysam>=0.13 python-Levenshtein>=0.12.0 pyyaml>=3.13 refgenconf>=0.7.0 refgenie ubiquerg>=0.6.1 yacman>=0.6.7
jpsmith5 commented 3 years ago

Yeah, that's good thought. I had thought about that and I'm not sure what stopped me, maybe thinking of the internet requirement? But your point regarding the person doing the curl approach already having internet connection is valid, duh. This would also work for the other file checks I utilize for the bulker checks.

jpsmith5 commented 2 years ago

Will use either local versions in repo or can be run remotely via curl, but does then necessitate internet connection.