Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3

luisfdez94 commented 3 years ago

Galaxy server : usegalaxy.eu History link: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissue Tool version : Galaxy Version 0.1.1

While executing Peptide Genomic Coordinate (following this Galaxy-P hands-on tutorial : [Tutorial3 : Novel peptides](https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/proteogenomics-novel-peptide-analysis/tutorial.html). See also Tutorial 1 : Database creation and Tutorial 2 : Database search), with our dataset (human species), the “Peptide Genomic Coordinate” tool returned an empty file.

I went through the source code; executing it locally in a « Debug » mode, I noticed that it didn’t enter in the « if » condition (in line 47), i.e. « coordinates » variable is empty at each iteration. However if I change line 41 : « acc = each[1] » for « acc = each[1].strip() » (trimming the spaces) it works. I noticed that, sometimes, proteins accession number (Ensembl ENSP) in the mz_to_sqlite input file, comes with a char at the end e.g. 'ENSP00000267884_A82P,P124A '. When, in line 44 (line when we do the query to fill « coordinates » variable), we did the matching with another tool’s input « Peptide_Genomic_Coordinate.sqlite », it does not work well because in this file, protein accessions do not contain this space e.g. 'ENSP00000267884_A82P,P124A'.

To help, I uploaded in the history the input files (data #1, #5 and #7) to execute Peptide Genomic Coordinate and the empty file, resulting from the execution on Galaxy of this tool (data #8). Also, I have uploaded the customized database (data #3) and the output file produced when I executed the tool locally with the modifications commented above (data #9 ). Thank you for your help.

jj-umn commented 3 years ago

@luisfdez94 @subinamehta Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.

generic|ENSP00000355265 |5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]

Should have been:

generic|ENSP00000355265|5239.2894|ENST00000361851|ENSG00000228253|MT-ATP8|mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol;Acc:HGNC:7415]

luisfdez94 commented 3 years ago

Thank you for your fast answer.

Luis, can you share the Database creation history? I want to figure out why there is a SPACE character after the protein ID from the CustomProDB workflow output.

There it goes: https://usegalaxy.eu/u/_luisfr/h/peptidegenomiccoordinateissuedbcreation As you can see at item #30, I also use the tool Regex Find And Replace to add generic| before each sequence in the customized DB (item #28): e.g. >ENSP00000355265 to >generic|ENSP00000355265 . I do this in order, SearchGui works.

jj-umn commented 3 years ago

@luisfdez94 @subinamehta The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE.

subinamehta commented 3 years ago

@jj :I think the workflow already takes care of that

On Mon, Nov 30, 2020 at 10:32 AM Jim Johnson notifications@github.com wrote:

@luisfdez94 https://github.com/luisfdez94 @subinamehta https://github.com/subinamehta The protein IDs from customProDB as a SPACE character at the end of the ID. I'm asking @chambm https://github.com/chambm if that seems correct. Most steps in these workflows do not handle an ID ending with a SPACE. We could add a steps after customProDB using regex tools to remove the SPACE.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/524#issuecomment-735858603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGP3A7LUONHRN3PLBLBCXBDSSO3INANCNFSM4UDQK4ZA .

--

Subina Mehta Bioinformatics Researcher Dept. of Biochemistry, Molecular Biology and Biophysics University of Minnesota 7-166 MCB 420 Washington Ave SE Minneapolis, MN 55455

Lab: 612-624-0381 Phone: 612-500-8841

Email: smehta@umn.edu smehta@umn.edu www.galaxyp.org http://www.galaxyp.org

luisfdez94 commented 3 years ago

Thanks to @jj-umn and Galaxy-P team help, I have been able to solve this issue. I had to do some modifications to the headers of every sequence of the customized DB (fasta) obtained at the end of Galaxy-P Tutorial 1 : Database creation. For that purpose I have used Regex Find And Replace v1.0.0 tool with the parameters shown at the end of the message.

Delete spaces at the end of each protein's accession ID (for Ensembl-PRO and STRG database)[see checks 3 and 5 at the end of the message]
Indel and snv reformatting (coming from CustomProDB) [checks 1 and 2] This is also done in Galaxy-P Tutorial 1 : Database creation section Genomic mapping database In this way our genomic mapping is consistent with our protein database.
Add generic| prefix at the beginning of each header coming from Ensembl-PRO and STRG database (not standard format for SearchGUI). See http://compomics.github.io/projects/searchgui/wiki/DatabaseHelp [checks 1, 3, 4 and 5]

Another important point if you follow Galaxy-P hands on tutorials is to input this modified Custom DB to _mz_tosqlite tool at Tutorial 2: DB search! I was inputting the original custom DB.

Regex Find And Replace v1.0.0

check: from ">ENSP00000360709_1123:GAT>CAAT " to ">generic|ENSP00000360709_1123:GAT_CAAT"

“Find Regex”: >(ENS.*_\d+:)([ACGTacgt]+)>([ACGTacgt]+)\s*
"Replacement": >generic|\1\2_\3

check: from ">ENSP00000457107_D77,L258P,L264P,S457G " to ">ENSP00000457107_D77.L258P.L264P.S457G "

“Find Regex”: ([A-Z,*][0-9]+[A-Z,*]),
"Replacement”: \1.

check: delete the final espace at the end of a protein accession ID (Ensembl DB). Also add the prefix "generic". From "ENSP00000360709 " to ">generic|ENSP00000360709"

“Find Regex”: >ENS[A-Z]*(.*)\s\|
"Replacement”: >generic|ENSP\1|

check: from ">STRG00000058243|" to ">generic|STRG00000058243|"

“Find Regex”: >STRG(\S*)\|
"Replacement”: >generic|STRG\1|

check: from ">STRG00000058243 |" to ">generic|STRG00000058243|"

“Find Regex”: >STRG(.*)\s\|
"Replacement”: >generic|STRG\1|

galaxyproteomics / tools-galaxyp

Peptide Genomic Coordinate issue while applying GalaxyP proteogenomics Tutorial3 #524