Open ypriverol opened 2 weeks ago
@zprobot I already added the score to PSI-MS: id:
MS:1003433
name: Andromeda:delta score
I have collected them in additional_scores.
- Score -> andromeda_score
- Delta score -> delta_score
@zprobot we should discuss the naming of the scores. One idea I have is that we have an additional parquet/csv called metadata.csv or matadata.parquet where we map all the keywords you are using to ontology terms. For example andromeda_score: Andromeda:delta score and also provide the accession in PSI-MS.
What do you think?
Agreed. We can have a mapping table to display these.
Can you model it, the use case will be, for scores, column names etc, where an acronym is used for example:
posterior_error_probability
we can find the correct cvterm for each in that table. @jpfeuffer what do you think?
It could be called: psi-ms-terms.parquet
Why do we use acronyms instead of the full name?
Ah you mean the ontology mapping. But the mapping is defined in the ontology, why would we want to replicate it? Just use the full/display name of the ontology entry.
Yes, For example. We use the following score acronyms right now:
posterior_error_probability
andromeda_score
msgf_rawsocre
etc.
Would be nice if we have a mapping table somewhere where the actual PSI term corresponding to that acronym is annotated like:
term | ontology_name | ontology_accession |
---|---|---|
posterior_error_probability | posterior error probability from identification based on multiple spectra | MS:1003336 |
andromeda_score | Andromeda:score | MS:1002338 |
msgf_rawscore | MS-GF:RawScore | MS:1002049 |
This could help to understand each column etc. The idea is that we have to use acronyms because is difficult in some cases to store the original term from PSI or other ontologies because they have spaces and special characters, it is better to have an acronym.
But can't the ontology have synonyms? I feel like this kind of mapping should not be our task.
Or we say that the name needs to match the ontology_name in snake_case. Only the unnecessary long name of PEP would be a problem here
Agreed. But if Ithe terms do not exist now, then I suggest having this table as optional to enable easy search, at least in our toolbox.
If this is an interim solution, I feel like we can just do without it. It is pretty clear what the score names mean. I really want to avoid having yet another table.
This is why I think it should be optional. These acronyms could be a bigger list, BTW. We use acronyms in scores, table column names, and additional information from the original search engines.
I still don't like it. Everything that we make optional is an additional if-case for everyone using that format. An additional check to see if that file is just missing or was forgotten. It also allows people to circumvent ontologies and starting their own naming schemes etc
This is exactly my point
It also allows people to circumvent ontologies and start their own naming schemes etc
A lot of terms are not ready for data handling. For example, percolator:PEP
is difficult if you want to skip special characters like :
, and it could be worse sometimes. This is why I have started to use acronyms which enable querying, sorting in duckDB by scores, etc.
But then, why do you need this table now? Document it for now and hard-code the CV term in a potential validator. Once the synonym is available in the ontology, you can switch the validator from a hard-coded dict to an actual ontology lookup
Because I don't want to hardcoded everything in the validator.
Then you could have the mapping file in your validator, but I would like to avoid a wild west format where people can map score names arbitrarily to some ontology. In the end you will have one dataset where PEP means posterior error probability, and in the other percent endogenous peptide or whatever.
Ok your idea is that the format itself, meaning the validator release an internal file for the mapping?
Yes because the mapping should be the same for every dataset out there.
Then, this file psi-ms-terms.parquet
could be an internal file maintained by us?
Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.
Yes, fine with me. But we really should put the used synonyms back into the actual ontology if possible.
I will try to trigger the conversation, but It may take a while 😉. @zprobot the idea is to keep the mapping table within the format library.
We can use a unified format to represent the scores given by search engines. like {software}_score
.
I think the mapping table is just for display purposes, used to view the available optional fields.
Two things:
Is the issue with :
and space-containing column names that they are impossible or not ergonomic? Most SQL engines support column names that aren't "proper identifiers" enclosed in double quotes. This applies to DuckDB, as well as tested with pyarrow
/pyarrow.parquet
and datafusion
.
e.g. using duckdb
from Python with a test table mocked up for convience:
>>>conn.sql("SELECT * FROM test;").show()
┌─────────────────┬─────────────┐
│ Andromeda:score │ scan number │
│ float │ int32 │
├─────────────────┼─────────────┤
│ 10.0 │ 1 │
│ 24.0 │ 2 │
│ -2.0 │ 3 │
└─────────────────┴─────────────┘
>>> conn.sql("""SELECT "Andromeda:score" FROM test;""").show()
┌─────────────────┐
│ Andromeda:score │
│ float │
├─────────────────┤
│ 10.0 │
│ 24.0 │
│ -2.0 │
└─────────────────┘
I agree with the argument that adding an alias table that introduces a combinatorial expansion of possible names for common columns is a big footgun.
The use of CURIEs is maximally stable, but minimally readable. The use of CURIE-backed names is a good compromise between readability and stability. If the CURIE-backed name isn't convenient, synonyms in the controlled vocabulary centralize the aliases, albeit if every term is heavily aliased we've not anyone any favors.
I was thinking of a more ergonomic
meaning the users don't need to deal with such many skipping characters.
Backing up a step, aren't scores encoded as pairs?
{
"type": "array",
"items": {
"type": "struct",
"fields": [
{"name": "score_name", "type": "string"},
{"name": "score_value", "type": "float32"}
],
}
}
or did this change while I wasn't paying attention?
This is the way is implemented:
{"name": "additional_scores",
"type": {"type": "array",
"items": { "type":
"struct", "field": {
"name": "string",
"value": "float32"
}
}
}
The point is that that name could be Andromeda:score
or Andromeda:delta score
which is not nice to filter, group etc.
I guess for additional_scores you can just add another field "CV term" and then use any "name" you like.
But I thought you are also worried about other columns?
But in that case the score names are strings, all requiring quoting, and where "special characters" do not matter unless you are typing them out by hand repeatedly for an ad hoc query.
Assuming a QWERTY layout, for Andromeda:score
vs andromeda_score
, you press only one extra key to write the CV name instead of the snakecase'd name due to the shift-key for the uppercase "A", the ":" and "\" both cost a shift. For Andromeda:delta score
vs andromeda_delta_score
you actually break even because you convert the space into a "_" which costs an extra shift, balancing the cost of the extra capitalization.
I suppose the goal here is to produce a file format that is suited directly to the quantms
pipeline's output though, in which case adding a new search engine is a breaking change in any case, so updating an alias table is par for the course.
@zprobot :
I have been looking at some MaxQuant examples for ms/ms. MaxQuant has the following scores:
Additionally, the delta score needs to be added to the PSI MS to be able to add in the ms/ms: https://github.com/HUPO-PSI/psi-ms-CV/issues/356