PCALF stand for Python CALcyanin Finder.
Calcyanin have been described in the article :
Benzerara, K., Duprat, E., Bitard-Feildel, T., Caumes, G., Cassier-Chauvat, C., Chauvat, F., ... & Callebaut, I. (2022). A new gene family diagnostic for intracellular biomineralization of amorphous Ca carbonates by cyanobacteria. Genome Biology and Evolution, 14(3), evac026.
The ccyA gene is searched at the protein level following a simple decision tree based on the specific composition of the C-ter extremity of this protein. We use a weighted HMM profile specific of the glycine zipper triplication (aka GlyX3) to detect sequences with a putative glycine triplication. Sequences with at least one hit against this profile are annotated with three HMM profiles specific of each Glycine zipper : Gly1, Gly2 and Gly3. A set of known calcyanins is used to infere the type of the N-ter extremity (X, Z, Y, CoBaHMA or ?). Finally, a flag is assign to each sequence depending on its N-terminus type and its C-terminus modular organization.
Conda and pip installation are not currently working for various reasons (the main one being : I don't know how to update conda package, and learning this is currently beneath "finding a job" in my to-do list)
However what you can do is create a conda env
conda create -n pcalf;
Then update said env with the the env.yml file that you can find on this repo.
conda activate pcalf
conda env update --file env.yml
Which should work.
Next download manually pcalf, go into pcalf directory (while still in pcalf environment created with conda) and
pip install .
python==3.9
pyhmmer==0.7.1
biopython==1.81
numpy>=1.21
pyyaml>=6
snakemake==7.22
pandas==1.5.3
tqdm==4.64.1
plotly==5.11.0
python-igraph==0.10.4
blast
If you want a standalone instalation, see BLAST manual for instructions on how to download it on Unix (Linux and Mac), or on Windows. BLAST can also be downloaded within a conda environment
conda install -c bioconda blast
Pcalf is composed of several commands :
This command can be used to look quickly for the presence of calcyanin in a set of amino acid sequences. It take one or more fasta files as input and output several files including a summary, a list of features and a list of raw hits produced during the search. It also produce updated HMMs and updated MSAs with newly detected calcyanin tagged as Calcyanin with known N-ter.
pcalf -i proteins.fasta
-o pcalf_results
--iterative-search
--iterative-update
--gly1-msa custom_gly1_msa.fasta
--iterative-search : If True, the search will be performed with profiles produced by the previous iteration until there is no new sequence detected or if --max-iteration is reached.
--iterative-update : True calcyanins (if any) will be added to the HMMs profiles iteratively starting with the best sequence (based on feature' score).
--gly1-msa : Use another MSA instead of the default one. A HMM profile will be built from the given MSA and Glycine weight will be increased.
By default, coverage and E-value threshold are infered from the HMM profiles that will be used for the search. To define the thresholds, a MSA is converted into a simple fasta file by deleting all gaps and aligned against its own HMM profile. A soft filtering is made using a coverage threshold of 0.5. The maximum E-values and the minimum coverage values are used to define the thresholds as follows : max(E-value)*10, min(coverage)-0.1
pcalf-datasets-workflow can be used to retrieve genomes from NCBI databases such as RefSeq and GenBank based on accession (e.g GCA_012769535.1) or TaxID.
Genomes and annotations (genes and CDS) will be downloaded using the new command line tools from NCBI. If annotations does not exists for a genome, then genes and CDS will be predicted using Prodigal.
A yaml file is also produced and can be used as input for pcalf-annotate-workflow.
GCA_012769535.1:
genome : /path/to/genome.gz
cds_faa: /path/to/cds_faa.gz
cds_fna: /path/to/cds_fna.gz
GCF_012769535.1:
...
example :
pcalf-datasets-workflow -t 1117
-o pcalf_datasets_results
-a file_with_accession.txt
-e file_with_accession_to_ignore.txt
The command above will download all cyanobacteria genome (taxid : 1117) and genomes specified in file_with_accession.txt. If any of them are specified in file_with_accession_to_ignore.txt, then they will be ignored.
pcalf-annotate-workflow is actually the workflow that will help you recover calcyanin proteins and ccyA genes from a set of genomes. This workflow is composed of multiple steps :
Note, that GTDB-TK and checkM requires external databases, respectively GTDB (~84 GB) and CheckM datas. As explained in GTDB-Tk documentation GTDB can be downloaded and unarchived with the following commands :
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_data.tar.gz (or, mirror)
tar xvzf gtdbtk_data.tar.gz
In addition, it's advised to run GTDB-TK and CheckM on a computer cluster. Because pcalf-annotate-workflow rely on snakemake you can easily provide a snakemake profile through the --snakargs option to run it on your favorite cluster. On the other hand, you can skip the checkm step by not calling the flag --checkm or failing to provide a path to the --checkm flag. The same can be done for the gtdtk steps with the --gtdb flag.
pcalf-annotate-workflow take as input a yaml file with a specific format, see pcalf-datasets-workflow for details. For a regular pcalf-datasets-workflow run, this file is called genomes.yaml.
example :
pcalf-annotate-workflow -i input_file.yaml
-o pcalf_annotate_results
--db sqlite3_file_from_another_run.db
--snakargs "--profile my_slurm_profile --use-conda --jobs 50"
--gtdb my_gtdb_directory
--checkm my_checkm_datas_directory
The command above will process all the genome specified in input_file.yaml through the pcalf-annotate-workflow including checkm and gtdb-tk steps. The sqlite3 file produced will be merged with sqlite3_file_from_another_run.db. The workflow will be ran on your computer cluster with 50 jobs at a time. See snakemake documentation for details about cluster execution.
Several output files for each step will be produced but the final output is a sqlite3 database storing multiple tables:
- genome # NCBI metadatas
- gtdbtk # GTDB-TK classification results
- checkm # Checkm Results
- summary # PCALF summary table
- features # PCALF features table
- hits # PCALF hits table
- ccyA # ccyA table
- gly1 # Gly1 MSA
- gly2 # Gly2 MSA
- gly3 # Gly3 MSA
- glyx3 # Glyx3 MSA
- nterdb # N-ter table
You can use the sqlite3 database from another run as a basis for a new one. In this case, MSAs stored in the sqlite3 file will be used to generate HMM profiles for pcalf.
This command produce an HTML report from a sqlite3 database given by pcalf annotate workflow.
example :
pcalf-report --db sqlite3_file.db --out report.html