Closed jalew188 closed 2 years ago
All readers could be accessed by alphabase.io.psm_reader.psm_reader_provider
There are two methods to get reader by reader name:
psm_reader_provider.get_reader(reader_name:str, *, column_mapping=dict, modification_mapping=dict)
reader_name
could be: alphapept, maxquant, pfind, spectronaut, diann. For example, psm_reader_provider.get_reader('alphapept', column_mapping=alphapept_column_dict, modification_mapping=alphapept_mod_dict)
.psm_reader_provider.get_reader_by_yaml(yaml_dict:dict)
yaml_dict
must contains {reader_type[str], column_mapping[dict], modification_mapping[str]}. We have defined several reader dicts in alphabase.io.psm_reader.psm_reader.yaml
.
PSMReaderBase
PSMReaderBase
is the base abstract class for all readers. It defines the basic procedures for importing other search engine results into AlphaBase format.The main entry method is
import_file(filename)
, and it will generateself._psm_df
(or propertyself.psm_df
) afterimport_file()
.In
import_file()
method, we designed five steps to load result files in to AlphaBase format:origin_df = self._load_file(filename)
. We load result files into a dataframe without doing any file conversion. As different search engines have different file format, some of them are not in the tabular format. All subclass ofPSMReaderBase
need to re-implement this method.self._translate_columns(origin_df)
. We translate columns inorigin_df
into AlphaBase columns byself.column_mapping
.self.column_mapping
provides a flexible way for developers to extract their required columns.self._load_modifications(origin_df)
. As different search engines have different representation of modifications. We use this method to extract the modifications intoself._psm_df['mods']
andself._psm_df['mod_sites']
. Note that the modification names are still in other search engines' format. All subclass ofPSMReaderBase
need to re-implement this method.self._translate_modifications()
. Convert modification names into AlphaBase names (unimod_name@AA
). For most of the search engines, we need a dict (self.modification_mapping
) to map search engine modification format into AlphaBase (unimod_name@AA
,unimod_name
isPSMReaderBase
need to re-implement this method.self._post_process(filename, origin_df)
. Any required post-processing steps. For example, we remove unknown modifications here.Other results must be converted into the alphabase dataframe with required columns:
sequence
(str): AA sequence, for example, 'ATMYPEDR'.mods
(str): modification names, separated by ';'. For example, 'Oxidation@M', 'Acetyl@Protein N-term;Oxidation@M'.mod_sites
(str): modification sites, seperated by ';'. For example, '3', '0;3'. The N-term site is 0, and the C-term site is -1, and all other modification sites start from 1.nAA
(int): number of AA in the sequence, could be set bydf['nAA']=df.sequence.str.len()
.charge
(int): precursor charge states.rt
(float): retention time (RT) of peptides, in minutes by default.rt_norm
(float): RT normalized by the maximum value, could be set bydf['rt_norm'] = df.rt/df.rt.max()
.and optional columns:
ccs
(float): collisional cross section (CCS) value, requred for IM data.mobility
(float): precursor ion mobility value, requred for IM data.precursor_mz
(float): precursor m/z value.proteins
(str): protein names, separated by ';'.genes
(str): gene names, separated by ';'.protein_ids
(str): protein ids or uniprot ids, separated by ';'.score
(float): PSM score. The larger the better PSMs, meaning thatE-value
orP-value
scores must be-log()
.fdr
(float): FDR or q-value.raw_name
(str): Raw file name.spec_idx
(int): scan number in Thermo RAW data, or spectrum index for other RAW data. We can use it to locate the MS2 spectrum for identification.query_id
(int or str): the unique id for not only inlucdes unique spectrum (spec_idx
), but also the precursor or MS1 isotope index. It could bequery_idx
in alphapept.decoy
: 0 if the peptide is target match, otherwise 1.All build-in readers:
alphabase.io.alphapept_reader.AlphaPeptReader
alphabase.io.maxquant_reader.MaxQuantReader
alphabase.io.pfind_reader.pFindReader
alphabase.io.dia_search_reader.DiannReader
alphabase.io.dia_search_reader.SpectronautReader
For new
readers
, for exampleclass NewSoftwareReader(PSMReaderBase):
, we just need to define:def __init__(self, *, column_mapping:dict=None, modification_mapping:dict=None)
Herecolumn_mapping
is the dict with key=alphabase columns, value=new software columns.modification_mapping
is the dict with key=alphabase modification name, value=new software modifications.def _load_file(self, file_name)->pd.DataFrame
: tell alphabase how to read the new software results intopandas.DataFrame
.And normally, basic methods defined in
PSMReaderBase
will handle all other steps.