PyProphet / pyprophet

PyProphet: Semi-supervised learning and scoring of OpenSWATH results.
http://www.openswath.org
BSD 3-Clause "New" or "Revised" License
29 stars 21 forks source link

Parquet Export #111

Open jcharkow opened 1 year ago

jcharkow commented 1 year ago

Issue

Some data scientists are not as familiar/comfortable with examining files in SQLite based formats however require more information than what is output from the export tsv (e.g. OpenSwathScores)

Apache Parquet as an alternative export

Paruqet is a columnar based storage format allowing for efficient storage since information is compressed column wise. is around the same size as the .osw file. This file format can be easily parsed with python using the pandas read_parquet command or using the pyarrow package. Then the data can easily be manupulated in a pandas dataframe.

Due to the columnar based storage, the Paruqet format shines when only a few of the columns are needed to be accessed.

Implementation

In this implementation each row corresponds with a single feature and each column is an attribute about that feature. The attributes are easily mapable back to the .osw file as columns follow the general naming convention of <OSW_Table_Name>.<OSW_COLUMN_NAME>. There are a few exceptions to this naming convention for ID columns (e.g. PRECURSOR_ID). For precursors without a corresponding feature with the FEATURE columns left as NAN

Alternatively if the --transition_level flag is specified, than transition level data is also exported (this takes much longer and is more memory intensive in the currernt implementation).

There are also some helper columns which can be used for filtering PRECURSOR_MASK - filters data such that there is one row per precursor (feature information corresponds with the top rank feature if possible) FEATURE_MASK - filters data such that there is one row per feature (filters out precursors with no matching feature) TOP_FEATURE_MASK- fitlers data such that each row is a feature with a SCORE_MS2.RANK of 1.

Dependencies

This requires the additional dependecy of the pyarrow package.

Current limitations

Currently IPF exporting is not supported. Exporting has only been tested with ion mobility data.


grosenberger commented 1 year ago

This will be a super useful addition, thanks for working on this. We should try to replace all data intensive output tables with parquet files and implement plug-and-play support in downstream applications, such as DIAlignR.