Open jcharkow opened 1 year ago
This will be a super useful addition, thanks for working on this. We should try to replace all data intensive output tables with parquet files and implement plug-and-play support in downstream applications, such as DIAlignR.
Issue
Some data scientists are not as familiar/comfortable with examining files in SQLite based formats however require more information than what is output from the export tsv (e.g. OpenSwathScores)
Apache Parquet as an alternative export
Paruqet is a columnar based storage format allowing for efficient storage since information is compressed column wise. is around the same size as the .osw file. This file format can be easily parsed with python using the pandas
read_parquet
command or using the pyarrow package. Then the data can easily be manupulated in a pandas dataframe.Due to the columnar based storage, the Paruqet format shines when only a few of the columns are needed to be accessed.
Implementation
In this implementation each row corresponds with a single feature and each column is an attribute about that feature. The attributes are easily mapable back to the .osw file as columns follow the general naming convention of
<OSW_Table_Name>.<OSW_COLUMN_NAME>.
There are a few exceptions to this naming convention for ID columns (e.g.PRECURSOR_ID
). For precursors without a corresponding feature with the FEATURE columns left as NANAlternatively if the
--transition_level
flag is specified, than transition level data is also exported (this takes much longer and is more memory intensive in the currernt implementation).There are also some helper columns which can be used for filtering
PRECURSOR_MASK
- filters data such that there is one row per precursor (feature information corresponds with the top rank feature if possible)FEATURE_MASK
- filters data such that there is one row per feature (filters out precursors with no matching feature)TOP_FEATURE_MASK
- fitlers data such that each row is a feature with aSCORE_MS2.RANK
of 1.Dependencies
This requires the additional dependecy of the
pyarrow
package.Current limitations
Currently IPF exporting is not supported. Exporting has only been tested with ion mobility data.