PSMReaderBase

PSMReaderBase is the base abstract class for all readers. It defines the basic procedures for importing other search engine results into AlphaBase format.

The main entry method is import_file(filename), and it will generate self._psm_df (or property self.psm_df) after import_file().

In import_file() method, we designed five steps to load result files in to AlphaBase format:

origin_df = self._load_file(filename). We load result files into a dataframe without doing any file conversion. As different search engines have different file format, some of them are not in the tabular format. All subclass of PSMReaderBase need to re-implement this method.
self._translate_columns(origin_df). We translate columns in origin_df into AlphaBase columns by self.column_mapping. self.column_mapping provides a flexible way for developers to extract their required columns.
self._load_modifications(origin_df). As different search engines have different representation of modifications. We use this method to extract the modifications into self._psm_df['mods'] and self._psm_df['mod_sites']. Note that the modification names are still in other search engines' format. All subclass of PSMReaderBase need to re-implement this method.
self._translate_modifications(). Convert modification names into AlphaBase names (unimod_name@AA). For most of the search engines, we need a dict (self.modification_mapping) to map search engine modification format into AlphaBase (unimod_name@AA, unimod_name is in the unimod xml file). All subclass of PSMReaderBase need to re-implement this method.
self._post_process(filename, origin_df). Any required post-processing steps. For example, we remove unknown modifications here.

Other results must be converted into the alphabase dataframe with required columns:

sequence (str): AA sequence, for example, 'ATMYPEDR'.
mods (str): modification names, separated by ';'. For example, 'Oxidation@M', 'Acetyl@Protein N-term;Oxidation@M'.
mod_sites (str): modification sites, seperated by ';'. For example, '3', '0;3'. The N-term site is 0, and the C-term site is -1, and all other modification sites start from 1.
nAA (int): number of AA in the sequence, could be set by df['nAA']=df.sequence.str.len().
charge (int): precursor charge states.
rt (float): retention time (RT) of peptides, in minutes by default.
rt_norm (float): RT normalized by the maximum value, could be set by df['rt_norm'] = df.rt/df.rt.max().
and optional columns:
ccs (float): collisional cross section (CCS) value, requred for IM data.
mobility (float): precursor ion mobility value, requred for IM data.
precursor_mz (float): precursor m/z value.
proteins (str): protein names, separated by ';'.
genes (str): gene names, separated by ';'.
protein_ids (str): protein ids or uniprot ids, separated by ';'.
score (float): PSM score. The larger the better PSMs, meaning that E-value or P-value scores must be -log().
fdr (float): FDR or q-value.
raw_name (str): Raw file name.
spec_idx (int): scan number in Thermo RAW data, or spectrum index for other RAW data. We can use it to locate the MS2 spectrum for identification.
query_id (int or str): the unique id for not only inlucdes unique spectrum (spec_idx), but also the precursor or MS1 isotope index. It could be query_idx in alphapept.
decoy: 0 if the peptide is target match, otherwise 1.

All build-in readers:

alphabase.io.alphapept_reader.AlphaPeptReader
alphabase.io.maxquant_reader.MaxQuantReader
alphabase.io.pfind_reader.pFindReader
alphabase.io.dia_search_reader.DiannReader
alphabase.io.dia_search_reader.SpectronautReader

For new readers, for example class NewSoftwareReader(PSMReaderBase):, we just need to define:

def __init__(self, *, column_mapping:dict=None, modification_mapping:dict=None) Here column_mapping is the dict with key=alphabase columns, value=new software columns. modification_mapping is the dict with key=alphabase modification name, value=new software modifications.
def _load_file(self, file_name)->pd.DataFrame: tell alphabase how to read the new software results into pandas.DataFrame.

And normally, basic methods defined in PSMReaderBase will handle all other steps.

MannLabs / alphabase

PSM reader #11

PSMReaderBase

Other results must be converted into the alphabase dataframe with required columns:

and optional columns:

All build-in readers: