The mandatory fields for "Human: All tissue-based experiments"

lisavetasol commented 4 years ago

I have a question about mandatory fields, especially for human tissues. I have looked at a few annotated projects and seems that in some (maybe even lots) of them the fields sex, age, development stage and/or ethic group are 'not available' and fields such as modifications, enzyme, instrument, tolerance, technical replicate are almost always filled. And as a person dealing with experiments sometimes, I can easily imagine that you don't really know donor's age or ethnic group (or cannot tell anybody), but doubt you don't know the enzyme or the instrument you used. Moreover, for most of the data analysis, you need enzyme and modifications rather than a donor's age. So I think it would be very useful to make more mandatory fields. And I understand that it could be more difficult for people to fill more mandatory fields, but any annotation without such information cannot be really used.

javizca commented 4 years ago

It is very difficult to find the right balance here. There is an increasing number of biological groups that use proteomics as a tool (e.g. making use of proteomics core facilities). In these cases, it is not uncommon that the biologists (usually the people that do the submissions) don't have much knowledge about the mass spec related information.

ypriverol commented 4 years ago

I have a question about mandatory fields, especially for human tissues. I have looked at a few annotated projects and seems that in some (maybe even lots) of them the fields sex, age, development stage and/or ethic group are 'not available' and fields such as modifications, enzyme, instrument, tolerance, technical replicate are almost always filled. And as a person dealing with experiments sometimes, I can easily imagine that you don't really know donor's age or ethnic group (or cannot tell anybody), but doubt you don't know the enzyme or the instrument you used. Moreover, for most of the data analysis, you need enzyme and modifications rather than a donor's age. So I think it would be very useful to make more mandatory fields. And I understand that it could be more difficult for people to fill more mandatory fields, but any annotation without such information cannot be really used.

@lisavetasol

1- most of these annotations have been done post-submission, then is difficult to find the tissue, ethic group by our annotations. But in the future I think this information will be more available during the submission, the biologist will be able to annotate that.

2- I think making fields mandatory will be a matter of how during the submission of the data we see how this data is available. In addition, we can force our validation tools to strongly recommend this information. As I mentioned before, this is still under development meaning we need to wait and see how biologists and mass spectrometrist adopt the file format and this will make easier the decision what will be mandatory and what should be only recommended.

lisavetasol commented 4 years ago

There is an increasing number of biological groups that use proteomics as a tool (e.g. making use of proteomics core facilities). In these cases, it is not uncommon that the biologists (usually the people that do the submissions) don't have much knowledge about the mass spec related information.

Absolutely true, people may not know it, but it's much easier for them to check how it was done, than for people who want to analyse this data and have to open raw files and check what the instrument was and so on. And do "open search" to find out what modifications should be used.

2- I think making fields mandatory will be a matter of how during the submission of the data we see how this data is available.

Not sure, but it seems that If there is no information about how mass spec was done, there is no profit of such annotation for analysing this data.

I understand, that it's still under development, just the set of mandatory fields is a bit strange - you assumed that 'biologists' know comment[fraction identifier] and comment[label], but don't know what enzyme or instrument was used?

levitsky commented 4 years ago

My two cents as a personal perspective (TLDR: I agree with Lisa here):

I think the value and the purpose of PRIDE itself, or any other repository for proteomics data, is to enable re-processing of data. In shotgun proteomics, for almost any such processing it is essential to know the basic experimental details (instrument type, enzyme etc., as well as annotation of technical replicates and such). Also, at least in principle, this information is always available to the submitter.

Since the annotation project (in my understanding) aims at facilitating the reprocessing of PRIDE data, I think it makes sense that the mandatory fields are those most necessary for reprocessing in the most general context. Like Lisa said, without instrument type or enzyme almost no reprocessing is possible, so I think these columns beat ethnic group or sex in importance, and should be strongly considered for making them mandatory.

As the focus shifts to annotation during submission, perhaps the process can be optimized, maybe even automated (e.g. for instrument type). This information can in theory be extracted from the files directly by the submission tool or on the server.

ypriverol commented 4 years ago

Your proposal is to make the following fields:

enzyme
instrument

required?

I agree in principal with it, and we request the instrument already during the submission process. If everyone agrees, We can make them required.

levitsky commented 4 years ago

To try and sum up the proposal:

enzyme - make required
instrument - make required (may be automated at submission?)
MS2 mass analyzer - make required (may be automated at submission?)
technical replicate - make required OR something like "required if applicable", should be the same logic as for fraction identifier (which is now required)
alkylation agent - add field (probably needs a separate issue here) and make something like "required/strongly advised if not standard (IAA, ClAA)"
fragmentation method - make required OR something like "strongly advised if not standard (CID/HCD)"

levitsky commented 4 years ago

Side note on the current mandatory columns: of 48 annotations currently in master, 44 pertain to human samples, and:

only 11 contain ethnic group
only 5 contain cell type

(To compare, the non-mandatory "instrument" is present in all annotations.)

This raises two questions:

Does it make sense to keep them formally mandatory if they are barely used?
Should the presence of mandatory columns be checked on validation?

ypriverol commented 4 years ago

@levitsky remember that all the examples we have here are post-submission annotations, but the current specification is also trying to define how future submissions should be performed to PX archives. the ethnic group is actually quite a common request in genomics and transcriptomics experiments for population studies. We hope during the submission process this data will be more available.

As I mentioned before, we should make mandatory instrument and enzyme as you suggested. Please, feel free to do a PR with the changes.

levitsky commented 4 years ago

@levitsky remember that all the examples we have here are post-submission annotations, but the current specification is also trying to define how future submissions should be performed to PX archives. the ethnic group is actually quite a common request in genomics and transcriptomics experiments for population studies. We hope during the submission process this data will be more available.

Thank you for this comment, and it does answer my first question. But there is also the second one: as a consumer of these annotations, should I not rely on the "mandatory" columns being always present in the file? Or is this (going to be) a violation of the standard to not include them? Do we need to add them to the annotations that omit them?

I started #222 with my suggestions on this.

As I mentioned before, we should make mandatory instrument and enzyme as you suggested. Please, feel free to do a PR with the changes.

OK. Should I do the same with MS2 mass analyzer and add comments for other columns as I listed?

ypriverol commented 4 years ago

@levitsky remember that all the examples we have here are post-submission annotations, but the current specification is also trying to define how future submissions should be performed to PX archives. the ethnic group is actually quite a common request in genomics and transcriptomics experiments for population studies. We hope during the submission process this data will be more available.

Thank you for this comment, and it does answer my first question. But there is also the second one: as a consumer of these annotations, should I not rely on the "mandatory" columns being always present in the file? Or is this (going to be) a violation of the standard to not include them? Do we need to add them to the annotations that omit them?

Mandatory fields should be always present as you point it out. If we found some datasets without the columns it should be reported. One thing is not clear in the specification and in the present discussions is if we will accept or not the value not available for those columns. This is what is happening now for most of the cases, we have the column but not value because our curators can't find them in the papers or submissions. I hope this information will be available when users (biologist) submit the data to PRIDE.

I started #222 with my suggestions on this.

As I mentioned before, we should make mandatory instrument and enzyme as you suggested. Please, feel free to do a PR with the changes.

OK. Should I do the same with MS2 mass analyzer and add comments for other columns as I listed?

I suggest to do the instrument first mandatory and then in the future we can try MS2 mass analyzer. Would be great also to have a list of fields somewhere and the type mandatory optional recommended and when we expose more users or the community to the specification them we conclude what they will finally be. What do you think?

levitsky commented 4 years ago

My thoughts:

we should allow "not available" in mandatory columns for post-submission annotations (don't have another choice, really);
- some of the required fields may still be missing on submission. These will be sample characteristics, but mass spec details can be safely required at all times on submission. Based on this reasoning, some mandatory fields can be allowed to be "not available" at submission, and some not;
- generally I think it's better to decide on mandatory columns as early as possible. Even if we fill it with "not available" in all annotated data sets, it's still better to do now than later, because the bulk of annotated data sets will grow, and so will the pain of updating the annotations to the new standard. For this reason I think if there is a consensus now, it's better to implement it now, or wait a little more for input from the community;
- having a list of mandatory and recommended fields would be good. Optional would probably be the rest of the ontology.

ypriverol commented 4 years ago

My thoughts:

we should allow "not available" in mandatory columns for post-submission annotations (don't have another choice, really);

👍

some of the required fields may still be missing on submission. These will be sample characteristics, but mass spec details can be safely required at all times on submission. Based on this reasoning, some mandatory fields can be allowed to be "not available" at submission, and some not;

generally I think it's better to decide on mandatory columns as early as possible. Even if we fill it with "not available" in all annotated data sets, it's still better to do now than later, because the bulk of annotated data sets will grow, and so will the pain of updating the annotations to the new standard. For this reason I think if there is a consensus now, it's better to implement it now, or wait a little more for input from the community;

Agree. The only column we have doubts now is MS2 mass analyzer.?

having a list of mandatory and recommended fields would be good. Optional would probably be the rest of the ontology.

👍

javizca commented 4 years ago

I think MS2 mass analyser is "too technical" information to become mandatory for the reasons mentioned before. The instrument name/model (which is mandatory) should be enough in most cases to understand the mass analyser/s that it contains. Second, usually there are more than one Mass Analyser and their annotation will be inconsistent. This was the case in the old PRIDE XML format. Rather than annotated manually, this information should be extracted if possible from the raw files.

lisavetasol commented 4 years ago

@javizca

I think MS2 mass analyser is "too technical" information to become mandatory for the reasons mentioned before. The instrument name/model (which is mandatory) should be enough in most cases to understand the mass analyser/s that it contains.

The reason for MS2 mass analyzer is exactly that the instrument name/model often isn't enough to understand where fragment ions were measured. For example, in lots of Orbitrap instruments, there is a possibility to measure fragments in the Orbitrap analyzer as well as in the Ion trap analyzer. And that's quite important to know for a proper mass tolerance in the search parameters (roughly 0.01 Da vs 0.1 Da). Some software tools work specifically with high-resolution MS/MS, and MS2 mass analyzer makes all the difference.

Second, usually there are more than one Mass Analyser and their annotation will be inconsistent. This was the case in the old PRIDE XML format. Rather than annotated manually, this information should be extracted if possible from the raw files.

I agree, this information for sure could be extracted from raw files (as well as instrument name btw, which is okay to be mandatory). In my opinion it's actually an argument in favor of making it mandatory, because it means that it will not be so burdensome for the submitter. Even if it's not automatic, there can be some rules in the annotation tool that set the MS2 analyzer automatically or limit the selection to a couple of options when instrument is set. However, I don't quite understand why it would be inconsistent. There is a nice ontology for mass analyzers, and it's even already added as CV Term in the PRIDE Ontology (issue #168)

ypriverol commented 4 years ago

@levitsky @lisavetasol While I fully understand your point. I really believe that making this filed mandatory for submissions will be a lot of work for submitters. I will vote to make this field optional for now and of course, all your annotated projects will probably have them and I will make sure that our annotators also add them. But, we should wait to make this mandatory for all the submissions. I actually, remove in a recent commit (https://github.com/bigbio/proteomics-metadata-standard/pull/214/commits/e66a7b983b1e383140de33c806183f34aa277d97) other required columns.

For the first release of the file format, I vote to have the very-least information reported in Methods section of papers and the experimental design (sample -> raw). What do you think @levitsky @lisavetasol

levitsky commented 4 years ago

I do of course accept this decision, although my thinking is a bit different.

I agree that it's important to keep in mind the load on the submitter, but the standard should reflect what is the most important for reanalysis. There are ways to make it easier for the submitter other than just not requiring it.

Anyway, I hope that those ways will be implemented and most of the data end up annotated with MS2 analyzer. Meanwhile, we can move on :) Thank you.

ypriverol commented 4 years ago

Thanks. Please @levitsky can you add to the specification a paragraph about MS2 analyzer? that reflect why is important, it can be a section after the instrument information.

bigbio / proteomics-sample-metadata

The mandatory fields for "Human: All tissue-based experiments" #218