MannLabs / alphabase

Infrastructure of AlphaX ecosystem
https://alphabase.readthedocs.io
Apache License 2.0
31 stars 9 forks source link

PSM reader #11

Closed jalew188 closed 2 years ago

jalew188 commented 2 years ago

PSMReaderBase

PSMReaderBase is the base abstract class for all readers. It defines the basic procedures for importing other search engine results into AlphaBase format.

The main entry method is import_file(filename), and it will generate self._psm_df (or property self.psm_df) after import_file().

In import_file() method, we designed five steps to load result files in to AlphaBase format:

  1. origin_df = self._load_file(filename). We load result files into a dataframe without doing any file conversion. As different search engines have different file format, some of them are not in the tabular format. All subclass of PSMReaderBase need to re-implement this method.

  2. self._translate_columns(origin_df). We translate columns in origin_df into AlphaBase columns by self.column_mapping. self.column_mapping provides a flexible way for developers to extract their required columns.

  3. self._load_modifications(origin_df). As different search engines have different representation of modifications. We use this method to extract the modifications into self._psm_df['mods'] and self._psm_df['mod_sites']. Note that the modification names are still in other search engines' format. All subclass of PSMReaderBase need to re-implement this method.

  4. self._translate_modifications(). Convert modification names into AlphaBase names (unimod_name@AA). For most of the search engines, we need a dict (self.modification_mapping) to map search engine modification format into AlphaBase (unimod_name@AA, unimod_name is in the unimod xml file). All subclass of PSMReaderBase need to re-implement this method.

  5. self._post_process(filename, origin_df). Any required post-processing steps. For example, we remove unknown modifications here.

Other results must be converted into the alphabase dataframe with required columns:

  1. sequence (str): AA sequence, for example, 'ATMYPEDR'.
  2. mods (str): modification names, separated by ';'. For example, 'Oxidation@M', 'Acetyl@Protein N-term;Oxidation@M'.
  3. mod_sites (str): modification sites, seperated by ';'. For example, '3', '0;3'. The N-term site is 0, and the C-term site is -1, and all other modification sites start from 1.
  4. nAA (int): number of AA in the sequence, could be set by df['nAA']=df.sequence.str.len().
  5. charge (int): precursor charge states.
  6. rt (float): retention time (RT) of peptides, in minutes by default.
  7. rt_norm (float): RT normalized by the maximum value, could be set by df['rt_norm'] = df.rt/df.rt.max().

    and optional columns:

  8. ccs (float): collisional cross section (CCS) value, requred for IM data.
  9. mobility (float): precursor ion mobility value, requred for IM data.
  10. precursor_mz (float): precursor m/z value.
  11. proteins (str): protein names, separated by ';'.
  12. genes (str): gene names, separated by ';'.
  13. protein_ids (str): protein ids or uniprot ids, separated by ';'.
  14. score (float): PSM score. The larger the better PSMs, meaning that E-value or P-value scores must be -log().
  15. fdr (float): FDR or q-value.
  16. raw_name (str): Raw file name.
  17. spec_idx (int): scan number in Thermo RAW data, or spectrum index for other RAW data. We can use it to locate the MS2 spectrum for identification.
  18. query_id (int or str): the unique id for not only inlucdes unique spectrum (spec_idx), but also the precursor or MS1 isotope index. It could be query_idx in alphapept.
  19. decoy: 0 if the peptide is target match, otherwise 1.

All build-in readers:

  1. alphabase.io.alphapept_reader.AlphaPeptReader
  2. alphabase.io.maxquant_reader.MaxQuantReader
  3. alphabase.io.pfind_reader.pFindReader
  4. alphabase.io.dia_search_reader.DiannReader
  5. alphabase.io.dia_search_reader.SpectronautReader

For new readers, for example class NewSoftwareReader(PSMReaderBase):, we just need to define:

  1. def __init__(self, *, column_mapping:dict=None, modification_mapping:dict=None) Here column_mapping is the dict with key=alphabase columns, value=new software columns. modification_mapping is the dict with key=alphabase modification name, value=new software modifications.

  2. def _load_file(self, file_name)->pd.DataFrame: tell alphabase how to read the new software results into pandas.DataFrame.

And normally, basic methods defined in PSMReaderBase will handle all other steps.

jalew188 commented 2 years ago

All readers could be accessed by alphabase.io.psm_reader.psm_reader_provider

There are two methods to get reader by reader name:

  1. psm_reader_provider.get_reader(reader_name:str, *, column_mapping=dict, modification_mapping=dict) reader_name could be: alphapept, maxquant, pfind, spectronaut, diann. For example, psm_reader_provider.get_reader('alphapept', column_mapping=alphapept_column_dict, modification_mapping=alphapept_mod_dict).
  2. psm_reader_provider.get_reader_by_yaml(yaml_dict:dict) yaml_dict must contains {reader_type[str], column_mapping[dict], modification_mapping[str]}. We have defined several reader dicts in alphabase.io.psm_reader.psm_reader.yaml.