Parquet needs some standarization

ypriverol commented 6 years ago

We need to do some standardization for the Parquet format that enables other people to understand the file format.

bgruening commented 3 years ago

Yeah, that would be nice and give it a proper name :)

sorenwacker commented 3 years ago

I like 'parquet', as it is pretty clear what library to use to open it.

Regarding column names. I had a few thoughts:

The column name Mass or Masses is technically wrong as it is M/Z values. Or do you convert the M/Z values into masses internally?
Intensities could be Intensity even if it is an array.
RetentionTime was used in mzXML files, in mzML files I have seen it as ScanTime which is a bit more general and may be more accurate. It would not imply that a chromatographic step was used.
Things like TIC are maybe convenient, but also somewhat redundant and it could be calculated easily in one line of code if the data would be in long format.

df_long.groupby('scan_time_min').sum().plot(y='intensity')

I am quite new to metabolomics/proteomics thought. I am looking at the problem more from a data science Python-biased perspective.

compomics / ThermoRawFileParser