Closed daichengxin closed 10 months ago
Sonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console. We are extremely grateful and thank you for your support over the years.
I don't think you should specify the separator for lists everywhere. This should be set once globally and be the same for every list type
General comment: Should we remove "peptide" and "protein" from the column names? I think it might be a bit redundant and might lead to additional if-cases in the code? Should we remove the "opt_global" prefix from the old mzTab columns? Yes, I know, people would immediately see the correspondence to mzTab but I think the prefix does not help much. I guess in the end they should be CVs anyway.
one thing to consider: how can we handle/report global protein groups? I think this is what others might do to come up with nice plots. Is there an easy way from this data to a "protein(group) data matrix"
For protein section we could also consider:
- to make it more future-proof one could consider adding an ion_mobility column
The current schema can be evolved to new versions with other fields.
For protein section we could also consider:
- gene of selected protein
- genes additional genes the identified peptide may originate from as a lot of downstream tasks will expect gene names and add a lot of value for consumers
@timosachsenberg Do you think that apart from protein_accessions
we should add gene_accessions
and gene_names
at PSM level?
Do you think that apart from protein_accessions we should add gene_accessions and gene_names?
If there is a sensible way to do, so I would say yes! Definitely will make a difference how many non-proteomic labs will use this data.
Do you think that apart from protein_accessions we should add gene_accessions and gene_names?
If there is a sensible way to do, so I would say yes! Definitely will make a difference how many non-proteomic labs will use this data.
Do you think that apart from protein_accessions we should add gene_accessions and gene_names?
If there is a sensible way to do, so I would say yes! Definitely will make a difference how many non-proteomic labs will use this data.
I will make them optional. In the last specification, I defined some optional columns, like the spectrum information. Good suggestion @timosachsenberg .
maybe also check if we are missing something essential from e.g., http://www.coxdocs.org/doku.php?id=maxquant:table:proteingrouptable and https://fragpipe.nesvilab.org/docs/tutorial_fragpipe_outputs.html
I think it is really essential that other labs can easily get a "gene expression matrix" and we might want to spend some time discussin gthis.
I think it is really essential that other labs can easily get a "gene expression matrix" and we might want to spend some time discussin gthis.
For now the format is for our quantms use cases
Also, I think this is a proteomics format. How would you even start summarizing the expression of different isoforms of a gene?
I guess there are multiple options to do so. https://www.mcponline.org/article/S1535-9476(22)00245-6/fulltext "we recommend doing differential abundance analysis on gene level and using isoform-level quantification only in cases where enough information is available."
Just saying that if we don't intend this format to be used by others, then we can of course do whatever we want. If this should also be used by people outside proteomics, then it is absolutely mandatory that we come up with a way to represent/derive gene expression matrices.
I think everyone can derive gene expression matrices in the way this paper describes by using the peptide-level results.
I think we do not have the manpower for new algorithms, adaption of the workflow and re-analysis with different databases, etc. to support that.
RE @lazear's suggestion for using UUIDs in filenames, it would also be possible to include the "cause" in the filename, e.g.
{
"peptide_table": "peptide_table-14e1299f-233a-40a0-9a75-ff1393151652.parquet",
"protein_table": "protein_table-e00f08f5-b3ab-463c-9771-0acc7144485e.parquet",
}
The intensity in this spec is now most often a float and once a double (from what I saw in a quick look at the PR). Sometimes intensities can be quite large numbers and might be above what the parquet float type (32 bit) can hold. If an intensity is roughly 4.3*10^9 (4'294'967'296 to be exact) it will not fit. If I am not mistaken, intensities can be in that range and it would probably be better to use the double data type for intensities.
The intensity in this spec is now most often a float and once a double (from what I saw in a quick look at the PR). Sometimes intensities can be quite large numbers and might be above what the parquet float type (32 bit) can hold. If an intensity is roughly 4.3*10^9 (4'294'967'296 to be exact) it will not fit. If I am not mistaken, intensities can be in that range and it would probably be better to use the double data type for intensities.
IEEE-754 floats have a max value of 3.40282347E+38. So I think hopefully we are safe :)
{
"peptide_table": "peptide_table-14e1299f-233a-40a0-9a75-ff1393151652.parquet",
"protein_table": "protein_table-e00f08f5-b3ab-463c-9771-0acc7144485e.parquet",
}
@fabianegli @lazear How do we recommend generating the UUI ids? Based on checksums?
UUIDs don't relate to the content of the files, they should just be random and contain enough information to make them globally unique. See this SO post for a short guide to uuid generation in Python.
UUIDs don't relate to the content of the files, they should just be random and contain enough information to make them globally unique. See this SO post for a short guide to uuid generation in Python.
OK, However, I think we should give some guidelines about how the UUIDs should look like.
They are defined in the RFC 4122, or do you mean the filename composition here in this PR? In which case I agree it would be good to have a defined recipe to generate those filenames.
IEEE-754 floats have a max value of 3.40282347E+38. So I think hopefully we are safe :)
@lazear Do you also know if the 6 digits of precision that they come with will be sufficient?
I will merge this first PR including the first proposal into main. Future changes should be provided over PRs.
IEEE-754 floats have a max value of 3.40282347E+38. So I think hopefully we are safe :)
@lazear Do you also know if the 6 digits of precision that they come with will be sufficient?
I'm not sure what you mean by 6 digits of precision. The precision of IEEE floats and doubles changes as a function of the integer component of the number. https://en.m.wikipedia.org/wiki/IEEE_754
The authors of the mzMLb format have a nice overview of how the mantissa length affects errors in proteomics data: see table 1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7871438/
They conclude that you can use less precision than provided by an IEEE-754 float for storing mz and intensity values in mzMLb. I don't think the minimal loss of precision from double->float will have a significant impact in the use case of storing search engine scores
In my experience it can make a difference when calculating TD FDRs where the exact ranking matters and an additional significant digit can push an identification several ranks up or down. If anyone would make a decision based on the FDR if the underlying scores are so close together is another question, though. Also it is hard to say how often this happens.
Scores are double.
UUIDs don't relate to the content of the files, they should just be random and contain enough information to make them globally unique. See this SO post for a short guide to uuid generation in Python.
OK, However, I think we should give some guidelines about how the UUIDs should look like.
UUIDs are just a suggestion - they are pretty well supported by most programming languages, so generating them is convenient.
I like @fabianegli's suggestion of prepending the table type as well. In the end, if you're pairing the filenames with some kind of JSON metadata with links to the files, the filenames themselves can be treated as opaque (e.g. if it doesn't make sense to read the parquet files without the metadata, then the filenames don't matter) - if that's not the case, then more care needs to be given to naming them (prepending table type)
@lazear Thank you for the link to the mzMLb paper, that was indeed a good read.
Hi all @daichengxin @jpfeuffer @timosachsenberg @lazear:
Here are some conclusions we have someway clear now:
1- For a better parquet definition, we can't have a design based on
_{}
the wide design. The wide design while it is more compact, it also creates more complication by the need to know the combination of_{}
with the specific property. For example at peptide level, if you have one peptide per row and each column is the expression/abundance on a specific sample, and in a next iteration we want to trace all number of psms by sample, then we need to create another relation_{}
. The current format (non-wide), will have the problem of the duplications for some fields, which we think that can be avoided/mitigate by compressing.2- The present figure
Defines the relation between all the files in quantms.
psm
: is all the information about the peptide-spectrum match.feature
: contains the relation between a peptide, intensity in a given file.peptide
: peptide + sample relation file with the corresponding abundance. We may not have this file for the differential expression datasets because MSstats do not provide peptide quant information as output.protein
: protein + sample relation file with the corresponding abundance.ae
: Absolute expression file protein expression for a protein in a given sample.de
: Differential expression file for all proteins with two contrast variables.