K2SOHIGH / pcalf

MIT License
1 stars 0 forks source link

PCALF - Retrieve Calcyanin protein and ccyA gene from genomes.

PCALF stand for Python CALcyanin Finder.

Calcyanin have been described in the article :

Benzerara, K., Duprat, E., Bitard-Feildel, T., Caumes, G., Cassier-Chauvat, C., Chauvat, F., ... & Callebaut, I. (2022). A new gene family diagnostic for intracellular biomineralization of amorphous Ca carbonates by cyanobacteria. Genome Biology and Evolution, 14(3), evac026.



Table of Contents :



Calcyanin detection :

The ccyA gene is searched at the protein level following a simple decision tree based on the specific composition of the C-ter extremity of this protein. We use a weighted HMM profile specific of the glycine zipper triplication (aka GlyX3) to detect sequences with a putative glycine triplication. Sequences with at least one hit against this profile are annotated with three HMM profiles specific of each Glycine zipper : Gly1, Gly2 and Gly3. A set of known calcyanins is used to infere the type of the N-ter extremity (X, Z, Y, CoBaHMA or ?). Finally, a flag is assign to each sequence depending on its N-terminus type and its C-terminus modular organization.



Decision tree :

decision_tree

Installation :

Conda and pip installation are not currently working for various reasons (the main one being : I don't know how to update conda package, and learning this is currently beneath "finding a job" in my to-do list)

However what you can do is create a conda env

conda create -n pcalf;

Then update said env with the the env.yml file that you can find on this repo.

conda activate pcalf
conda env update --file env.yml

Which should work.

Next download manually pcalf, go into pcalf directory (while still in pcalf environment created with conda) and

pip install .

Dependencies :

python==3.9
pyhmmer==0.7.1
biopython==1.81
numpy>=1.21
pyyaml>=6
snakemake==7.22
pandas==1.5.3
tqdm==4.64.1
plotly==5.11.0
python-igraph==0.10.4

External dependency :

blast

If you want a standalone instalation, see BLAST manual for instructions on how to download it on Unix (Linux and Mac), or on Windows. BLAST can also be downloaded within a conda environment

conda install -c bioconda blast



Usage :

Pcalf is composed of several commands :


pcalf :

This command can be used to look quickly for the presence of calcyanin in a set of amino acid sequences. It take one or more fasta files as input and output several files including a summary, a list of features and a list of raw hits produced during the search. It also produce updated HMMs and updated MSAs with newly detected calcyanin tagged as Calcyanin with known N-ter.

pcalf -i proteins.fasta 
      -o pcalf_results 
      --iterative-search
      --iterative-update 
      --gly1-msa custom_gly1_msa.fasta

--iterative-search : If True, the search will be performed with profiles produced by the previous iteration until there is no new sequence detected or if --max-iteration is reached.

--iterative-update : True calcyanins (if any) will be added to the HMMs profiles iteratively starting with the best sequence (based on feature' score).

--gly1-msa : Use another MSA instead of the default one. A HMM profile will be built from the given MSA and Glycine weight will be increased.

Thresholds:

By default, coverage and E-value threshold are infered from the HMM profiles that will be used for the search. To define the thresholds, a MSA is converted into a simple fasta file by deleting all gaps and aligned against its own HMM profile. A soft filtering is made using a coverage threshold of 0.5. The maximum E-values ​​and the minimum coverage values ​​are used to define the thresholds as follows : max(E-value)*10, min(coverage)-0.1


pcalf-datasets-workflow :

pcalf-datasets-workflow can be used to retrieve genomes from NCBI databases such as RefSeq and GenBank based on accession (e.g GCA_012769535.1) or TaxID.

Genomes and annotations (genes and CDS) will be downloaded using the new command line tools from NCBI. If annotations does not exists for a genome, then genes and CDS will be predicted using Prodigal.

A yaml file is also produced and can be used as input for pcalf-annotate-workflow.

GCA_012769535.1:
  genome : /path/to/genome.gz
  cds_faa: /path/to/cds_faa.gz
  cds_fna: /path/to/cds_fna.gz
GCF_012769535.1:
...

example :

pcalf-datasets-workflow -t 1117 
                        -o pcalf_datasets_results 
                        -a file_with_accession.txt 
                        -e file_with_accession_to_ignore.txt

The command above will download all cyanobacteria genome (taxid : 1117) and genomes specified in file_with_accession.txt. If any of them are specified in file_with_accession_to_ignore.txt, then they will be ignored.


pcalf-annotate-workflow :

pcalf-annotate-workflow is actually the workflow that will help you recover calcyanin proteins and ccyA genes from a set of genomes. This workflow is composed of multiple steps :

Note, that GTDB-TK and checkM requires external databases, respectively GTDB (~84 GB) and CheckM datas. As explained in GTDB-Tk documentation GTDB can be downloaded and unarchived with the following commands :

wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_data.tar.gz  (or, mirror)
tar xvzf gtdbtk_data.tar.gz

In addition, it's advised to run GTDB-TK and CheckM on a computer cluster. Because pcalf-annotate-workflow rely on snakemake you can easily provide a snakemake profile through the --snakargs option to run it on your favorite cluster. On the other hand, you can skip the checkm step by not calling the flag --checkm or failing to provide a path to the --checkm flag. The same can be done for the gtdtk steps with the --gtdb flag.

pcalf-annotate-workflow take as input a yaml file with a specific format, see pcalf-datasets-workflow for details. For a regular pcalf-datasets-workflow run, this file is called genomes.yaml.

example :

pcalf-annotate-workflow -i input_file.yaml 
                        -o pcalf_annotate_results 
                        --db sqlite3_file_from_another_run.db
                        --snakargs "--profile my_slurm_profile --use-conda --jobs 50" 
                        --gtdb my_gtdb_directory 
                        --checkm my_checkm_datas_directory

The command above will process all the genome specified in input_file.yaml through the pcalf-annotate-workflow including checkm and gtdb-tk steps. The sqlite3 file produced will be merged with sqlite3_file_from_another_run.db. The workflow will be ran on your computer cluster with 50 jobs at a time. See snakemake documentation for details about cluster execution.

Several output files for each step will be produced but the final output is a sqlite3 database storing multiple tables:

- genome          # NCBI metadatas
- gtdbtk          # GTDB-TK classification results
- checkm          # Checkm Results 
- summary         # PCALF summary table
- features        # PCALF features table
- hits            # PCALF hits table
- ccyA            # ccyA table
- gly1            # Gly1 MSA
- gly2            # Gly2 MSA
- gly3            # Gly3 MSA
- glyx3           # Glyx3 MSA
- nterdb          # N-ter table

You can use the sqlite3 database from another run as a basis for a new one. In this case, MSAs stored in the sqlite3 file will be used to generate HMM profiles for pcalf.


pcalf-report :

This command produce an HTML report from a sqlite3 database given by pcalf annotate workflow.

example :

pcalf-report --db sqlite3_file.db --out report.html



Workflow :

Workflow description