Closed luisfdez94 closed 3 years ago
@luisfdez94 @subinamehta Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.
generic|ENSP00000355265 |5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]
Should have been:
generic|ENSP00000355265|5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]
Thank you for your fast answer.
Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.
There it goes: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissuedbcreation As you can see at item #30, I also use the tool Regex Find And Replace to add generic| before each sequence in the customized DB (item #28): e.g. >ENSP00000355265 to >generic|ENSP00000355265 . I do this in order, SearchGui works.
@luisfdez94 @subinamehta The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE.
@jj :I think the workflow already takes care of that
On Mon, Nov 30, 2020 at 10:32 AM Jim Johnson notifications@github.com wrote:
@luisfdez94 https://github.com/luisfdez94 @subinamehta https://github.com/subinamehta The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm https://github.com/chambm if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/524#issuecomment-735858603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGP3A7LUONHRN3PLBLBCXBDSSO3INANCNFSM4UDQK4ZA .
--
Subina Mehta Bioinformatics Researcher Dept. of Biochemistry, Molecular Biology and Biophysics University of Minnesota 7-166 MCB 420 Washington Ave SE Minneapolis, MN 55455
Lab: 612-624-0381 Phone: 612-500-8841
Email: smehta@umn.edu smehta@umn.edu www.galaxyp.org http://www.galaxyp.org
Thanks to @jj-umn and Galaxy-P team help, I have been able to solve this issue. I had to do some modifications to the headers of every sequence of the customized DB (fasta) obtained at the end of Galaxy-P Tutorial 1 : Database creation. For that purpose I have used Regex Find And Replace v1.0.0 tool with the parameters shown at the end of the message.
generic|
prefix at the beginning of each header coming from Ensembl-PRO and STRG database (not standard format for SearchGUI). See http://compomics.github.io/projects/searchgui/wiki/DatabaseHelp [checks 1, 3, 4 and 5]Another important point if you follow Galaxy-P hands on tutorials is to input this modified Custom DB to _mz_tosqlite tool at Tutorial 2: DB search! I was inputting the original custom DB.
Regex Find And Replace v1.0.0
>(ENS.*_\d+:)([ACGTacgt]+)>([ACGTacgt]+)\s*
>generic|\1\2_\3
([A-Z,*][0-9]+[A-Z,*]),
\1.
>ENS[A-Z]*(.*)\s\|
>generic|ENSP\1|
>STRG(\S*)\|
>generic|STRG\1|
>STRG(.*)\s\|
>generic|STRG\1|
Galaxy server : usegalaxy.eu History link: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissue Tool version : Galaxy Version 0.1.1
While executing Peptide Genomic Coordinate (following this Galaxy-P hands-on tutorial : [Tutorial3 : Novel peptides](https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/proteogenomics-novel-peptide-analysis/tutorial.html). See also Tutorial 1 : Database creation and Tutorial 2 : Database search), with our dataset (human species), the “Peptide Genomic Coordinate” tool returned an empty file.
I went through the source code; executing it locally in a « Debug » mode, I noticed that it didn’t enter in the « if » condition (in line 47), i.e. « coordinates » variable is empty at each iteration. However if I change line 41 : « acc = each[1] » for « acc = each[1].strip() » (trimming the spaces) it works. I noticed that, sometimes, proteins accession number (Ensembl ENSP) in the mz_to_sqlite input file, comes with a char at the end e.g. 'ENSP00000267884_A82P,P124A '. When, in line 44 (line when we do the query to fill « coordinates » variable), we did the matching with another tool’s input « Peptide_Genomic_Coordinate.sqlite », it does not work well because in this file, protein accessions do not contain this space e.g. 'ENSP00000267884_A82P,P124A'.
To help, I uploaded in the history the input files (data #1, #5 and #7) to execute Peptide Genomic Coordinate and the empty file, resulting from the execution on Galaxy of this tool (data #8). Also, I have uploaded the customized database (data #3) and the output file produced when I executed the tool locally with the modifications commented above (data #9 ). Thank you for your help.