HUPO-PSI / mzTab

mzTab Reporting MS-based Proteomics and Metabolomics Results
https://hupo-psi.github.io/mzTab
37 stars 16 forks source link

mzTab conversion to dataframe #200

Closed eeko-kon closed 3 years ago

eeko-kon commented 3 years ago

Hello! I am generating mzTab files from sirius, that have MTD, SMH and SML initials , but utf8 encoding only recognizes the SML. Any suggestions? This is my script:


import pandas as pd
import numpy as np
import sys
import pyteomics
from pyteomics import mztab
filename= "./wf_testing/out_sirius_test.mzTab"
df=  pyteomics.mztab.MzTab(filename, encoding='UTF8', table_format='df')
sirius = pd.DataFrame(sirius)
print(sirius)

output:

0  PRT              Empty DataFrame
Columns: []
Index: []
1  PEP              Empty DataFrame
Columns: []
Index: []
2  PSM              Empty DataFrame
Columns: []
Index: []
3  SML     identifier chemical_formula smiles inchi_ke...
eeko-kon commented 3 years ago

This is part of the file generated from sirius (.mzTab):

MTD mzTab-version   1.0.0
MTD mzTab-mode  null
MTD mzTab-type  null
MTD description Sirius-4.6.0
MTD smallmolecule_search_engine_score[1]    [, , SiriusScore, ]
MTD smallmolecule_search_engine_score[2]    [, , TreeScore, ]
MTD smallmolecule_search_engine_score[3]    [, , IsotopeScore, ]
MTD ms_run[1]-location  data Thermo Orbitrap ID-X/FileFiltered Std/Agnes_POS_MDNA_WGS_103_Filtered.mzML

SMH identifier  chemical_formula    smiles  inchi_key   description exp_mass_to_charge  calc_mass_to_charge charge  retention_time  taxid   species database    database_version    spectra_ref search_engine   best_search_engine_score[1] best_search_engine_score[2] best_search_engine_score[3] modifications   opt_global_adduct   opt_gobal_precursorFormula  opt_global_rank opt_global_explainedPeaks   opt_global_explainedIntensity   opt_global_median_mass_error_fragment_peaks_ppm opt_global_median_absolute_mass_error_fragment_peaks_ppm    opt_global_mass_error_precursor_ppm opt_global_compoundId   opt_global_compoundScanNumber   opt_global_featureId    opt_global_native_id
SML null    C17H25BN2O2S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    39.4614114117431    39.4614114117431    0.0 null    [M + H3N + H]+  C17H28BN3O2S    1   18  0.904773718693427   3.793037970198739   5.609411499914324   -4.038155963543013  745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C17H28BN3O2S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    39.4614114117431    39.4614114117431    0.0 null    [M + H]+    C17H28BN3O2S    1   18  0.904773718693427   3.793037970198739   5.609411499914324   -4.038155963543013  745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C17H30BN3O3S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    39.4614114117431    39.4614114117431    0.0 null    [M - H2O + H]+  C17H28BN3O2S    1   19  0.904773718693427   3.793037970198739   5.609411499914324   -5.157935070801511e04   745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C15H26FN3O2S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    35.360942231040987  35.360942231040987  0.0 null    [M + H3N + H]+  C15H29FN4O2S    2   17  0.824253005426175   1.643124931790097   7.92227946447542    6.370360784334207   745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C15H31FN4O3S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    35.360942231040987  35.360942231040987  0.0 null    [M - H2O + H]+  C15H29FN4O2S    2   18  0.824253005426175   1.643124931790097   7.92227946447542    -5.156894219126724e04   745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
sneumann commented 3 years ago

Hi, a quick guess is that the issue should go to the pyteomics team. And from the above description I don't fully understand the issue, please be a bit more specific what you'd expect. It looks like 3 SML identifier chemical_formula smiles inchi_ke... was imported in your script ? Are you missing the MTD information in your data frame ? Yours, Steffen

eeko-kon commented 3 years ago

Hi Steffen,

Most importantly I am missing the SML data. I will contact the team, thank you!

sneumann commented 3 years ago

Ah, I hadn't realised that should've been in the print() output. Does the pyteomics load the https://github.com/HUPO-PSI/mzTab/tree/master/examples/1_0-Proteomics-Release fine ? Particularly the lipidomics and faahKO examples ? Yours, Steffen

eeko-kon commented 3 years ago

It doesn't. I am getting the same exact output. Best, Efi

eeko-kon commented 3 years ago

I'm converting the file to a dataframe for no reason. The correct way is simply to call

import pandas as pd
import numpy as np
import sys
import pyteomics
from pyteomics import mztab
filename= "./wf_testing/out_sirius_test.mzTab"
sirius=  pyteomics.mztab.MzTab(filename, encoding='UTF8', table_format='df')
sirius.metadata
sirius.small_molecule_table

All good :)