bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

Add the decoy Sanger tool to the library #19

Closed ypriverol closed 5 years ago

ypriverol commented 5 years ago

We need to add the decoy Sanger tool to the library. The tool is the following:

https://www.sanger.ac.uk/science/tools/decoypyrat

ypriverol commented 5 years ago

@yafeng @husensofteng I have added the decoy tool. The way to test it is using the command:

proteomicsdevmbpr1:pypgatk yperez$ python3.7 pypgatk_cli.py generate-decoy -h 
Usage: pypgatk_cli.py generate-decoy [OPTIONS]

Options:
  -c, --config_file TEXT          Configuration file for the protein database
                                  decoy generation
  -o, --output TEXT               Output file for decoy database
  -i, --input TEXT                FASTA file of target proteins sequences for
                                  which to create decoys (*.fasta|*.fa)
  -s, --cleavage_sites TEXT       A list of amino acids at which to cleave
                                  during digestion. Default = KR
  -a, --anti_cleavage_sites TEXT  A list of amino acids at which not to cleave
                                  if following cleavage site ie. Proline.
                                  Default = none
  -p, --cleavage_position TEXT    Set cleavage to be c or n terminal of
                                  specified cleavage sites. Options [c, n],
                                  Default = c
  -l, --min_peptide_length INTEGER
                                  Set minimum length of peptides to compare
                                  between target and decoy. Default = 5
  -n, --max_iterations INTEGER    Set maximum number of times to shuffle a
                                  peptide to make it non-target before
                                  failing. Default=100
  -x, --do_not_shuffle TEXT       Turn OFF shuffling of decoy peptides that
                                  are in the target database. Default=false
  -w, --do_not_switch TEXT        Turn OFF switching of cleavage site with
                                  preceding amino acid. Default=false
  -d, --decoy_prefix TEXT         Set accession prefix for decoy proteins in
                                  output. Default=DECOY_
  -t, --temp_file TEXT            Set temporary file to write decoys prior to
                                  shuffling. Default=protein-decoy.fa
  -b, --no_isobaric TEXT          Do not make decoy peptides isobaric.
                                  Default=false
  -m, --memory_save TEXT          Slower but uses less memory (does not store
                                  decoy peptide list). Default=false
  -h, --help                      Show this message and exit.
ypriverol commented 5 years ago

@yafeng We need to take a decision about the DECOY Ids. The way the tool generate the ids is by creating a new protein with the following accession DECOY_1 , DECOY_2... Can you check this works in our pipelines? The second problem I see is that we lost all the information related with the proteins, My guess is that we will recover that information in the re-mapping after the protein identification step?

yafeng commented 5 years ago

@ypriverol I have modified the script so that the output decoy sequences contain original protein ID, this is required to distinguish decoys from different classes.

ypriverol commented 5 years ago

I will chec kthe code because it doesn't look like it will work now. LEt me check

ypriverol commented 5 years ago

@yafeng I just updated the code and it works fine now. I removed also the from the concatenation, that means that the decoy prefix should contain the `` itself. Can you test now the current version? If you are happy with that I can close this issue.

yafeng commented 5 years ago

@ypriverol this is first time I tried, i got this error. Do I forget to setup something?

Traceback (most recent call last): File "pypgatk/pypgatk_cli.py", line 11, in from pypgatk.commands import ensembl_downloader as ensembl_downloader_cmd ModuleNotFoundError: No module named 'pypgatk'

husensofteng commented 5 years ago

It works here. @yafeng that is weird may be you should run the setup.py script with install command

ypriverol commented 5 years ago

The method to use the library now is you need to install it first, as:

python3.7 setup.py install

The current tool is more a library with a commandline tool than simple scripts that is why you need to install it. @enriquea if you succeed can you improve the README.

yafeng commented 5 years ago

Please update the README, the following python packages needs to be installed.

pip install click
pip install PyVCF
pip install gffutils
pip install pyyaml
pip install biopython
ypriverol commented 5 years ago

This packages are added in the requirements.txt you should be able to install then using pip.

yafeng commented 5 years ago

pip install -r requirements.txt did not give any error.

ypriverol commented 5 years ago

The protocol to build the package should be:

1 -

pip install -r requirements.txt 

2-

python3.7 setup.py install

then you should be able to run the script.

ypriverol commented 5 years ago

@yafeng @enriquea did you manage to build the package ?

enriquea commented 5 years ago

I'm getting the following error:

enrique$ python3.6 pypgatk_cli.py --help
Traceback (most recent call last):
  File "pypgatk_cli.py", line 14, in <module>
    from pypgatk.commands import cosmic_to_proteindb as cosmic_to_proteindb_cmd
  File "/anaconda3/lib/python3.6/site-packages/pypgatk-0.0.1-py3.6.egg/pypgatk/commands/cosmic_to_proteindb.py", line 3, in <module>
    from pypgatk.cgenomes.cgenomes_proteindb import CancerGenomesService
  File "/anaconda3/lib/python3.6/site-packages/pypgatk-0.0.1-py3.6.egg/pypgatk/cgenomes/cgenomes_proteindb.py", line 3, in <module>
    from Bio import SeqIO
ModuleNotFoundError: No module named 'Bio'
husensofteng commented 5 years ago

@enriquea you need to install the biopython package as noted above I will update the docs to list the requirements.

ypriverol commented 5 years ago

@yafeng If the decoy tool work for you please close this issue.