How can I get the gene columns information from the MS-GF+ results

MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.

Other

72 stars 36 forks source link

How can I get the gene columns information from the MS-GF+ results #105

Open Jokendo-collab opened 4 years ago

Jokendo-collab commented 4 years ago

Is there a way in which I can get the gene ID from the MS-GF+ analysis results? Well I know this information is possible with MaxQuant but I prefer using MS-GF+ owing to its speed in the sequence database search. Kindly advise.

The reason I want these gene information is because I want to use clusterProfile for Gene ontology analysis to determine the Biological processes which are significant in our data

FarmGeek4Life commented 4 years ago

Is the geneID part of the protein identifier/description in the .fasta file? If it is, then the different in-file references in the results can direct you back to the protein identifier and description, but I don't know of a specific tool that will provide the full description in a simpler-to-read format.

Jokendo-collab commented 4 years ago

I have the human database which I downloaded from Uniprot and I am using it to do the database search in my data. My software uses MS-GF+ as a search engine. As I mentioned earlier this is possible with Maxquant because it gives the gene ID and the protein ID column in the protein groups file and I was just wondering if there is a way this can be achieved in MS-GF+. Running two search engines sometimes is boring and it would have been easier for me to just do it once with MS-GF+.

FarmGeek4Life commented 4 years ago

So, this isn't integrated into MS-GF+, but you can use the latest version of the MzidToTsvConverter with the command-line argument -geneid to add an additional column to the TSV file, where the gene ID is extracted from the protein description using a regular expression. The default regular expression supports the format sp|P08758|ANXA5_HUMAN and would put ANXA5 in the Gene ID column. You can also supply a different regular expression using -geneid "[regular expression]". You can look at the readme for an example.