bigbio /

The proteomics quantification format, extending mzTab for large scale datasets.
7 stars 4 forks source link

MaxQuant scores #82

Open ypriverol opened 2 weeks ago

ypriverol commented 2 weeks ago

@zprobot :

I have been looking at some MaxQuant examples for ms/ms. MaxQuant has the following scores:

Additionally, the delta score needs to be added to the PSI MS to be able to add in the ms/ms:

ypriverol commented 2 weeks ago

@zprobot I already added the score to PSI-MS: id:

name: Andromeda:delta score
zprobot commented 2 weeks ago

I have collected them in additional_scores.

- Score -> andromeda_score
- Delta score -> delta_score
ypriverol commented 2 weeks ago

@zprobot we should discuss the naming of the scores. One idea I have is that we have an additional parquet/csv called metadata.csv or matadata.parquet where we map all the keywords you are using to ontology terms. For example andromeda_score: Andromeda:delta score and also provide the accession in PSI-MS.

What do you think?

zprobot commented 2 weeks ago

Agreed. We can have a mapping table to display these.

ypriverol commented 2 weeks ago

Can you model it, the use case will be, for scores, column names etc, where an acronym is used for example: posterior_error_probability we can find the correct cvterm for each in that table. @jpfeuffer what do you think?

It could be called: psi-ms-terms.parquet

jpfeuffer commented 2 weeks ago

Why do we use acronyms instead of the full name?

jpfeuffer commented 2 weeks ago

Ah you mean the ontology mapping. But the mapping is defined in the ontology, why would we want to replicate it? Just use the full/display name of the ontology entry.

ypriverol commented 2 weeks ago

Yes, For example. We use the following score acronyms right now:

posterior_error_probability andromeda_score msgf_rawsocre


Would be nice if we have a mapping table somewhere where the actual PSI term corresponding to that acronym is annotated like:

term ontology_name ontology_accession
posterior_error_probability posterior error probability from identification based on multiple spectra MS:1003336
andromeda_score Andromeda:score MS:1002338
msgf_rawscore MS-GF:RawScore MS:1002049

This could help to understand each column etc. The idea is that we have to use acronyms because is difficult in some cases to store the original term from PSI or other ontologies because they have spaces and special characters, it is better to have an acronym.

jpfeuffer commented 2 weeks ago

But can't the ontology have synonyms? I feel like this kind of mapping should not be our task.

jpfeuffer commented 2 weeks ago

Or we say that the name needs to match the ontology_name in snake_case. Only the unnecessary long name of PEP would be a problem here

ypriverol commented 2 weeks ago

Agreed. But if Ithe terms do not exist now, then I suggest having this table as optional to enable easy search, at least in our toolbox.

jpfeuffer commented 2 weeks ago

If this is an interim solution, I feel like we can just do without it. It is pretty clear what the score names mean. I really want to avoid having yet another table.

ypriverol commented 2 weeks ago

This is why I think it should be optional. These acronyms could be a bigger list, BTW. We use acronyms in scores, table column names, and additional information from the original search engines.

jpfeuffer commented 2 weeks ago

I still don't like it. Everything that we make optional is an additional if-case for everyone using that format. An additional check to see if that file is just missing or was forgotten. It also allows people to circumvent ontologies and starting their own naming schemes etc

ypriverol commented 2 weeks ago

This is exactly my point

It also allows people to circumvent ontologies and start their own naming schemes etc

A lot of terms are not ready for data handling. For example, percolator:PEP is difficult if you want to skip special characters like :, and it could be worse sometimes. This is why I have started to use acronyms which enable querying, sorting in duckDB by scores, etc.

jpfeuffer commented 2 weeks ago

But then, why do you need this table now? Document it for now and hard-code the CV term in a potential validator. Once the synonym is available in the ontology, you can switch the validator from a hard-coded dict to an actual ontology lookup

ypriverol commented 2 weeks ago

Because I don't want to hardcoded everything in the validator.

jpfeuffer commented 2 weeks ago

Then you could have the mapping file in your validator, but I would like to avoid a wild west format where people can map score names arbitrarily to some ontology. In the end you will have one dataset where PEP means posterior error probability, and in the other percent endogenous peptide or whatever.

ypriverol commented 2 weeks ago

Ok your idea is that the format itself, meaning the validator release an internal file for the mapping?

jpfeuffer commented 2 weeks ago

Yes because the mapping should be the same for every dataset out there.

ypriverol commented 2 weeks ago

Then, this file psi-ms-terms.parquet could be an internal file maintained by us?

jpfeuffer commented 2 weeks ago

Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.

ypriverol commented 2 weeks ago

Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.

I will try to trigger the conversation, but It may take a while 😉. @zprobot the idea is to keep the mapping table within the format library.

zprobot commented 2 weeks ago

We can use a unified format to represent the scores given by search engines. like {software}_score. I think the mapping table is just for display purposes, used to view the available optional fields.

ypriverol commented 2 weeks ago

Two things:

mobiusklein commented 2 weeks ago

Is the issue with : and space-containing column names that they are impossible or not ergonomic? Most SQL engines support column names that aren't "proper identifiers" enclosed in double quotes. This applies to DuckDB, as well as tested with pyarrow/pyarrow.parquet and datafusion.

e.g. using duckdb from Python with a test table mocked up for convience:

>>>conn.sql("SELECT * FROM test;").show()
│ Andromeda:score │ scan number │
│      float      │    int32    │
│            10.0 │           1 │
│            24.0 │           2 │
│            -2.0 │           3 │

>>> conn.sql("""SELECT "Andromeda:score" FROM test;""").show()
│ Andromeda:score │
│      float      │
│            10.0 │
│            24.0 │
│            -2.0 │

I agree with the argument that adding an alias table that introduces a combinatorial expansion of possible names for common columns is a big footgun.

The use of CURIEs is maximally stable, but minimally readable. The use of CURIE-backed names is a good compromise between readability and stability. If the CURIE-backed name isn't convenient, synonyms in the controlled vocabulary centralize the aliases, albeit if every term is heavily aliased we've not anyone any favors.

ypriverol commented 2 weeks ago

I was thinking of a more ergonomic meaning the users don't need to deal with such many skipping characters.

mobiusklein commented 2 weeks ago

Backing up a step, aren't scores encoded as pairs?

  "type": "array",
  "items": {
    "type": "struct",
    "fields": [
        {"name": "score_name", "type": "string"},
        {"name": "score_value", "type": "float32"}

or did this change while I wasn't paying attention?

ypriverol commented 2 weeks ago

This is the way is implemented:

{"name": "additional_scores", 
   "type": {"type": "array",
            "items": { "type": 
                "struct", "field": { 
                      "name": "string", 
                      "value": "float32"

The point is that that name could be Andromeda:score or Andromeda:delta score which is not nice to filter, group etc.

jpfeuffer commented 2 weeks ago

I guess for additional_scores you can just add another field "CV term" and then use any "name" you like.

But I thought you are also worried about other columns?

mobiusklein commented 2 weeks ago

But in that case the score names are strings, all requiring quoting, and where "special characters" do not matter unless you are typing them out by hand repeatedly for an ad hoc query.

Assuming a QWERTY layout, for Andromeda:score vs andromeda_score, you press only one extra key to write the CV name instead of the snakecase'd name due to the shift-key for the uppercase "A", the ":" and "\" both cost a shift. For Andromeda:delta score vs andromeda_delta_score you actually break even because you convert the space into a "_" which costs an extra shift, balancing the cost of the extra capitalization.

I suppose the goal here is to produce a file format that is suited directly to the quantms pipeline's output though, in which case adding a new search engine is a breaking change in any case, so updating an alias table is par for the course.

zprobot commented 2 weeks ago

We can provide a file like this. It is used to describe the information of all fields currently in use. Fields