epam / NGB

New Genome Browser (NGB) - a Web - based NGS data viewer with unique Structural Variations (SVs) visualization capabilities, high performance, scalability, and cloud data support
MIT License
161 stars 42 forks source link

BLAST Wrapper service [draft] #376

Open mzueva opened 3 years ago

VolLol commented 3 years ago

Using BLAST and Docker Image

Input data

Output data

Output file with the results: score of 14.2 and an E-value of 0.96.

Step-by-Step guide how to use Blast and Docker Image

  1. Install docker
  2. Create a directory for analysis used following command. In the next clauses we will use {HOME} for denote root folder.
    mkdir blastdb queries fasta results blastdb_custom
  3. Retrieve query sequences
    docker run --rm ncbi/blast efetch -db protein -format fasta -id P01349 > queries/P01349.fsa
  4. Retrieve database sequences
    docker run --rm ncbi/blast efetch -db protein -format fasta -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 > fasta/nurse-shark-proteins.fsa
  5. Make BLAST database with the following command, where {HOME} mean root folder from 2 clause
    docker run --rm -v {HOME}/blastdb_custom:/blast/blastdb_custom:rw -v {HOME}/fasta:/blast/fasta:ro -w /blast/blastdb_custom ncbi/blast makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" -taxid 7801 -blastdb_version 5
  6. Run Blast. Remember about that {HOME} it is your root folder from clause 2.
    docker run --rm -v {HOME}/blastdb:/blast/blastdb:ro -v {HOME}/blastdb_custom:/blast/blastdb_custom:ro -v {HOME}/queries:/blast/queries:ro -v {HOME}/results:/blast/results:rw ncbi/blast blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins -out /blast/results/blastp.out

You should see the output file {HOME}/results/blastp.out

File examples

P01349.fsa ``` >sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName: Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain QLCGRGFIRAIIFACGGSRWATSPAMSIKCCIYGCTKKDISVLC ```
nurse-shark-proteins.fsa ``` >sp|Q90523.1|RAG1_GINCI RecName: Full=V(D)J recombination-activating protein 1; Short=RAG-1; Includes: RecName: Full=Endonuclease RAG1; Includes: RecName: Full=E3 ubiquitin-protein ligase RAG1; AltName: Full=RING-type E3 ubiquitin transferase RAG1 RSTFPTRECPELLCQYSCNSQRFAELLRTEFKHRYEGKITNYLHKTLAHVPEIIERDGSIGAWASEGNES GNKLFRRFRKMNARQSKSYELEGILKHHWLYTSKYL >sp|P80049.1|FABPL_GINCI RecName: Full=Fatty acid-binding protein, liver; AltName: Full=Liver-type fatty acid-binding protein; Short=L-FABP VEAFLGSWKLQKSHNFDEYMKNLDVSLAQRKVATTVKPKTIISLDGDVITIKTESTFKSTNIQFKLAEEF DETTADNRTTKTTVKLENGKLVQTQRWDGKETTLVRELQDGKLILTCTMGDVVCTREYVREQ >sp|P83981.1|IGW1_GINCI RecName: Full=IgW transmembrane form Tm1T3/Tm6T3/Tm3C4 VQAVPPDVKGEEGKEEVEDMDGDDNAPTVAAFAILFILSFLYSTFVTVVKVQQ >sp|P83977.1|IGNAR_GINCI RecName: Full=IgNAR transmembrane form NE LTFSTRSLLNLPAVEWKSGAKYTCTASHSPSQSTVKRVIRNPKESPKGSSETRKSPLEIMESPEDYGTEE DQLENVNEDSNSSLSIMTFVILFIL >sp|P83984.1|IGWP_GINCI RecName: Full=Pancreatic IgW, short secretory form PAVQGNDKKYTMSSVLKVSAADWKTNRFYQCQAGYNLNDMVESRFETPKIQEPNITALVPSVDFISSQNA VLGCVISGFAPDNIRVSWKKDQVDQTGVVLSSKRRTDNTFETISYLIIPTVNWKIGDKYTCEVSHPPSNF RSMISMKYQEGEKSSCPSGSPPGTCPPCSLTPELLYHTNLSVTLRNGEKH >sp|P83985.1|IGWS_GINCI RecName: Full=Splenic IgW, short secretory form PAVQGNDKKYTMSSVLKVSAADWKTNRFYQCQAGYNLNDMVESRFETPKIQEPNITALVPSVDFISSQNA VLGCVISGFAPDNIRVSWKKDQVDQTGVVLSSKRRTDNTFETISYLIIPTVNWKIGDKYTCEVSHPPSNF RSMISMKYQEGEKSSCPSGSPPGTCPPCSLTPELLYHTNLSVTLRNGEKHQYKCD >sp|P27950.1|NDK_GINCI RecName: Full=Nucleoside diphosphate kinase; Short=NDK; Short=NDP kinase GNKERTFIAVKPDGVQRCIVGEVIKRFEQKGFKLVAMKFLQAPKDLLEKHYCELSDKPFYPKLIKYMSSG PVVAMVWEGLNVVKTGRVMLGETNPADSKPGTIRGDFCIQVGRNIIHGSDSVESAKKEISLWFKREELVE YQNCAQDWIYE ```
blastp.out ``` BLASTP 2.11.0+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005. Database: Nurse shark proteins 7 sequences; 922 total letters Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName: Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain Length=44 Score E Sequences producing significant alignments: (Bits) Value P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName... 14.2 0.96 >P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName: Full=Liver-type fatty acid-binding protein; Short=L-FABP Length=132 Score = 14.2 bits (25), Expect = 0.96, Method: Compositional matrix adjust. Identities = 3/9 (33%), Positives = 6/9 (67%), Gaps = 0/9 (0%) Query 2 LCGRGFIRA 10 +C R ++R Sbjct 123 VCTREYVRE 131 Lambda K H a alpha 0.334 0.143 0.520 0.792 4.96 Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6 Effective search space used: 22680 Database: Nurse shark proteins Posted date: Mar 2, 2021 12:07 PM Number of letters in database: 922 Number of sequences in database: 7 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40 ```