Add protein, peptide, psm io format

daichengxin commented 11 months ago

Hi all @daichengxin @jpfeuffer @timosachsenberg @lazear:

Here are some conclusions we have someway clear now:

1- For a better parquet definition, we can't have a design based on _{} the wide design. The wide design while it is more compact, it also creates more complication by the need to know the combination of _{} with the specific property. For example at peptide level, if you have one peptide per row and each column is the expression/abundance on a specific sample, and in a next iteration we want to trace all number of psms by sample, then we need to create another relation _{}. The current format (non-wide), will have the problem of the duplications for some fields, which we think that can be avoided/mitigate by compressing.

2- The present figure

Defines the relation between all the files in quantms.

psm: is all the information about the peptide-spectrum match.
feature: contains the relation between a peptide, intensity in a given file.
peptide: peptide + sample relation file with the corresponding abundance. We may not have this file for the differential expression datasets because MSstats do not provide peptide quant information as output.
protein: protein + sample relation file with the corresponding abundance.
ae: Absolute expression file protein expression for a protein in a given sample.
de: Differential expression file for all proteins with two contrast variables.

sonatype-lift[bot] commented 11 months ago

Sonatype Lift is retiring

Sonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console. We are extremely grateful and thank you for your support over the years.

📖 Read about the impacts and timeline

jpfeuffer commented 11 months ago

I don't think you should specify the separator for lists everywhere. This should be set once globally and be the same for every list type

jpfeuffer commented 11 months ago

General comment: Should we remove "peptide" and "protein" from the column names? I think it might be a bit redundant and might lead to additional if-cases in the code? Should we remove the "opt_global" prefix from the old mzTab columns? Yes, I know, people would immediately see the correspondence to mzTab but I think the prefix does not help much. I guess in the end they should be CVs anyway.

timosachsenberg commented 11 months ago

one thing to consider: how can we handle/report global protein groups? I think this is what others might do to come up with nice plots. Is there an easy way from this data to a "protein(group) data matrix"

timosachsenberg commented 11 months ago

to make it more future-proof one could consider adding an ion_mobility column

For protein section we could also consider:

gene of selected protein
genes additional genes the identified peptide may originate from as a lot of downstream tasks will expect gene names and add a lot of value for consumers

ypriverol commented 11 months ago

to make it more future-proof one could consider adding an ion_mobility column

The current schema can be evolved to new versions with other fields.

For protein section we could also consider:

gene of selected protein

genes additional genes the identified peptide may originate from as a lot of downstream tasks will expect gene names and add a lot of value for consumers

@timosachsenberg Do you think that apart from protein_accessions we should add gene_accessions and gene_names at PSM level?

timosachsenberg commented 11 months ago

Do you think that apart from protein_accessions we should add gene_accessions and gene_names?

If there is a sensible way to do, so I would say yes! Definitely will make a difference how many non-proteomic labs will use this data.

ypriverol commented 11 months ago

Do you think that apart from protein_accessions we should add gene_accessions and gene_names?

If there is a sensible way to do, so I would say yes! Definitely will make a difference how many non-proteomic labs will use this data.

Do you think that apart from protein_accessions we should add gene_accessions and gene_names?

If there is a sensible way to do, so I would say yes! Definitely will make a difference how many non-proteomic labs will use this data.

I will make them optional. In the last specification, I defined some optional columns, like the spectrum information. Good suggestion @timosachsenberg .

timosachsenberg commented 11 months ago

maybe also check if we are missing something essential from e.g., http://www.coxdocs.org/doku.php?id=maxquant:table:proteingrouptable and https://fragpipe.nesvilab.org/docs/tutorial_fragpipe_outputs.html

timosachsenberg commented 11 months ago

I think it is really essential that other labs can easily get a "gene expression matrix" and we might want to spend some time discussin gthis.

ypriverol commented 11 months ago

I think it is really essential that other labs can easily get a "gene expression matrix" and we might want to spend some time discussin gthis.

For now the format is for our quantms use cases

jpfeuffer commented 11 months ago

Also, I think this is a proteomics format. How would you even start summarizing the expression of different isoforms of a gene?

timosachsenberg commented 11 months ago

I guess there are multiple options to do so. https://www.mcponline.org/article/S1535-9476(22)00245-6/fulltext "we recommend doing differential abundance analysis on gene level and using isoform-level quantification only in cases where enough information is available."

Just saying that if we don't intend this format to be used by others, then we can of course do whatever we want. If this should also be used by people outside proteomics, then it is absolutely mandatory that we come up with a way to represent/derive gene expression matrices.

jpfeuffer commented 11 months ago

I think everyone can derive gene expression matrices in the way this paper describes by using the peptide-level results.

jpfeuffer commented 11 months ago

I think we do not have the manpower for new algorithms, adaption of the workflow and re-analysis with different databases, etc. to support that.

fabianegli commented 10 months ago

RE @lazear's suggestion for using UUIDs in filenames, it would also be possible to include the "cause" in the filename, e.g.

{
    "peptide_table": "peptide_table-14e1299f-233a-40a0-9a75-ff1393151652.parquet",
    "protein_table": "protein_table-e00f08f5-b3ab-463c-9771-0acc7144485e.parquet",
}

fabianegli commented 10 months ago

The intensity in this spec is now most often a float and once a double (from what I saw in a quick look at the PR). Sometimes intensities can be quite large numbers and might be above what the parquet float type (32 bit) can hold. If an intensity is roughly 4.3*10^9 (4'294'967'296 to be exact) it will not fit. If I am not mistaken, intensities can be in that range and it would probably be better to use the double data type for intensities.

lazear commented 10 months ago

The intensity in this spec is now most often a float and once a double (from what I saw in a quick look at the PR). Sometimes intensities can be quite large numbers and might be above what the parquet float type (32 bit) can hold. If an intensity is roughly 4.3*10^9 (4'294'967'296 to be exact) it will not fit. If I am not mistaken, intensities can be in that range and it would probably be better to use the double data type for intensities.

IEEE-754 floats have a max value of 3.40282347E+38. So I think hopefully we are safe :)

ypriverol commented 10 months ago

{
    "peptide_table": "peptide_table-14e1299f-233a-40a0-9a75-ff1393151652.parquet",
    "protein_table": "protein_table-e00f08f5-b3ab-463c-9771-0acc7144485e.parquet",
}

@fabianegli @lazear How do we recommend generating the UUI ids? Based on checksums?

fabianegli commented 10 months ago

UUIDs don't relate to the content of the files, they should just be random and contain enough information to make them globally unique. See this SO post for a short guide to uuid generation in Python.

ypriverol commented 10 months ago

UUIDs don't relate to the content of the files, they should just be random and contain enough information to make them globally unique. See this SO post for a short guide to uuid generation in Python.

OK, However, I think we should give some guidelines about how the UUIDs should look like.

fabianegli commented 10 months ago

They are defined in the RFC 4122, or do you mean the filename composition here in this PR? In which case I agree it would be good to have a defined recipe to generate those filenames.

fabianegli commented 10 months ago

IEEE-754 floats have a max value of 3.40282347E+38. So I think hopefully we are safe :)

@lazear Do you also know if the 6 digits of precision that they come with will be sufficient?

ypriverol commented 10 months ago

I will merge this first PR including the first proposal into main. Future changes should be provided over PRs.

lazear commented 10 months ago

IEEE-754 floats have a max value of 3.40282347E+38. So I think hopefully we are safe :)

@lazear Do you also know if the 6 digits of precision that they come with will be sufficient?

I'm not sure what you mean by 6 digits of precision. The precision of IEEE floats and doubles changes as a function of the integer component of the number. https://en.m.wikipedia.org/wiki/IEEE_754

The authors of the mzMLb format have a nice overview of how the mantissa length affects errors in proteomics data: see table 1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7871438/

They conclude that you can use less precision than provided by an IEEE-754 float for storing mz and intensity values in mzMLb. I don't think the minimal loss of precision from double->float will have a significant impact in the use case of storing search engine scores

jpfeuffer commented 10 months ago

In my experience it can make a difference when calculating TD FDRs where the exact ranking matters and an additional significant digit can push an identification several ranks up or down. If anyone would make a decision based on the FDR if the underlying scores are so close together is another question, though. Also it is hard to say how often this happens.

ypriverol commented 10 months ago

Scores are double.

lazear commented 10 months ago

UUIDs don't relate to the content of the files, they should just be random and contain enough information to make them globally unique. See this SO post for a short guide to uuid generation in Python.

OK, However, I think we should give some guidelines about how the UUIDs should look like.

UUIDs are just a suggestion - they are pretty well supported by most programming languages, so generating them is convenient.

I like @fabianegli's suggestion of prepending the table type as well. In the end, if you're pairing the filenames with some kind of JSON metadata with links to the files, the filenames themselves can be treated as opaque (e.g. if it doesn't make sense to read the parquet files without the metadata, then the filenames don't matter) - if that's not the case, then more care needs to be given to naming them (prepending table type)

fabianegli commented 10 months ago

@lazear Thank you for the link to the mzMLb paper, that was indeed a good read.

bigbio / quantms.io

Add protein, peptide, psm io format #3

Sonatype Lift is retiring