Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
30 stars 10 forks source link

Initial PR #2

Closed mezarque closed 1 year ago

mezarque commented 1 year ago

This project focuses on building "maps" of similar proteins for exploratory analysis.

About this repo

This repo uses Snakemake to orchestrate BLAST and Foldseek queries for user-provided protein FASTA and PDB files of interest. The rule graph currently looks like this:

rulegraph

The goal of the repo is to build out a pipeline that looks like this flowchart.

Most of the rules call Python scripts that are designed to be modular and which accept input and output arguments using argparse. The core functions underlying each script will be made accessible via an importable package in the future, so that a user who wants to can interactively run each underlying command themselves. The first iteration of functionality for this repo will hopefully work something like gget which allows calls from the command line and from within a Python script.

Some of these rules can probably be compressed to make the rule graph less complicated.

PR Details

This PR sets up the first step of the workflow, which performs BLAST and Foldseek queries via API.

All of the important code is in the Snakefile and the utils/ directory. Feedback on this code would be super appreciated!

There are a bunch of files in the notebooks/ directory which show what the repo will ultimately do using Snakemake, except using a Jupyter notebook.

This folder will probably be completely reworked in the future to hold tutorial notebooks for interactively running the scripts, so no need to review it. I've left it in the repo in case someone wanted to check out what this repo will eventually do.

Known issues

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

mertcelebi commented 1 year ago

I just got trigger happy and hit submit on my review, but wanted to give a kudos to @mezarque! This is a good PR and I think you did a good job with the individual Python files. If we needed to, this would be fairly trivial to convert to Nextflow because of that. Good job