MoseleyBioinformaticsLab / MESSES

MESSES (Metadata from Experimental SpreadSheets Extraction System) is a Python package that facilitates the conversion of tabular data into other formats.
https://moseleybioinformaticslab.github.io/MESSES/
Other
0 stars 0 forks source link

ISA Assay Creation Issues #24

Open ptth222 opened 1 year ago

ptth222 commented 1 year ago

You may want to view this on GitHub since I have embedded tables and such. https://github.com/MoseleyBioinformaticsLab/MESSES/issues

I am going to start with a short description of how ISA assays work. I am going to work from the tab format because it is easier to understand, but the issues exist in the JSON version as well.

Here is an example ISA assay tab file:

Sample Name Protocol REF Extract Name Protocol REF Labeled Extract Name Label MS Assay Name Comment[PRIDE Accession] Comment[PRIDE Processed Data Accession] Raw Spectral Data File Normalization Name Protein Assignment File Peptide Assignment File Post Translational Modification Assignment File Data Transformation Name Derived Spectral Data File
S-0.1-aliquot11 protein extraction S-0.1 ITRAQ labeling JC_S-0.1 iTRAQ reagent 117 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml
C-0.1-aliquot11 protein extraction C-0.1 ITRAQ labeling JC_C-0.1 iTRAQ reagent 116 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml
N-0.1-aliquot11 protein extraction N-0.1 ITRAQ labeling JC_N-0.1 iTRAQ reagent 115 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml
S-0.1-aliquot11 protein extraction S-0.1 ITRAQ labeling Pool1 iTRAQ reagent 114 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml
C-0.1-aliquot11 protein extraction C-0.1 ITRAQ labeling Pool1 iTRAQ reagent 114 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml
N-0.1-aliquot11 protein extraction N-0.1 ITRAQ labeling Pool1 iTRAQ reagent 114 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml
C-0.2-aliquot11 protein extraction C-0.2 ITRAQ labeling JC_C-0.2 iTRAQ reagent 117 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml
N-0.2-aliquot11 protein extraction N-0.2 ITRAQ labeling JC_N-0.2 iTRAQ reagent 116 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml
P-0.1-aliquot11 protein extraction P-0.1 ITRAQ labeling JC_P-0.1 iTRAQ reagent 115 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml
C-0.2-aliquot11 protein extraction C-0.2 ITRAQ labeling Pool2 iTRAQ reagent 114 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml
N-0.2-aliquot11 protein extraction N-0.2 ITRAQ labeling Pool2 iTRAQ reagent 114 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml
P-0.1-aliquot11 protein extraction P-0.1 ITRAQ labeling Pool2 iTRAQ reagent 114 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml
P-0.2-aliquot11 protein extraction P-0.2 ITRAQ labeling JC_P-0.2 iTRAQ reagent 116 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml
S-0.2-aliquot11 protein extraction S-0.2 ITRAQ labeling JC_S-0.2 iTRAQ reagent 115 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml
P-0.2-aliquot11 protein extraction P-0.2 ITRAQ labeling Pool3 iTRAQ reagent 117 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml
S-0.2-aliquot11 protein extraction S-0.2 ITRAQ labeling Pool3 iTRAQ reagent 117 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml
P-0.2-aliquot11 protein extraction P-0.2 ITRAQ labeling Pool3 iTRAQ reagent 114 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml
S-0.2-aliquot11 protein extraction S-0.2 ITRAQ labeling Pool3 iTRAQ reagent 114 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml

The order of the columns matter. The very first column must be a Sample Name column, and there can be no other Sample Name columns. If you have multiple entities deriving from each other they are called "extracts" in an assay. You can see in this example it goes from a sample to an extract after the "protein extraction" protocol, and then to a labeled extract after the "ITRAQ labeling" protocol. They call the "Protocol REF" columns "process nodes", but there are also other process nodes. The "MS Assay Name", "Normalization Name", and "Data Transformation Name" columns are also process nodes, but there is a difference. Protocol process nodes have a protocol, but the other ones don't, the names underneath of them just name the process, not the protocol. It seems to me like they essentially make a distinction between actions done on physical entities and actions done on data. Actions done on physical entities have an associated protocol, but actions done on data don't. They don't expressly say that, but that's what it looks like based on the example. I also think it would be valid to change "MS Assay Name" to a "Protocol REF" and create a "MS Assay" protocol if someone wanted. That would be the only way to give a description of the "MS Assay" process.

An important thing to note about all of this is that each sample/extract can have only 1 process/protocol applied to it at a time. This is an issue because we allow "protocol.id" to be a list field for entities. I think we might have to enforce 1 protocol for entities for ISA conversions. This restriction isn't just for assays, it applies to study processes as well.

Also note that they actually show analysis type process steps where as we typically don't. It isn't required, but we may want to think about adding some "analytical" type protocols or something if people do want to specify it similarly to what ISA shows here.

One issue is deciding where in the sample/extract chain to create the assay. This example starts just before protein extraction, but we could make one as simple as 3 columns. For example:

Sample Name Protocol REF Raw Spectral Data File
Sample1 ICMS1 File1
Sample2 ICMS1 File2
Sample3 ICMS1 File3

This is simple to do and would just require looking at the measurement protocol, but if we wanted to start sooner we would have to go to the measurement entity and then just go back up the lineage to some point. Deciding where to stop could be difficult. Maybe just 1 hop up the lineage.

To summarize:

  1. Do we limit protocols to 1 per entity for ISA submissions, or come up with something else?
  2. Do we add analytical type protocols or keep that as part of the measurement?
  3. Is the simple 3 column assay good enough?
hunter-moseley commented 1 year ago

We should talk this out, but here are my suggestions after first reading these issues.

  1. Do we limit protocols to 1 per entity for ISA submissions, or come up with something else? Suggestion: We could create a composed protocol to list one per entity, but that it refers to a list of protocols. Not exactly sure how this would work. It depends on how protocols are described.

  2. Do we add analytical type protocols or keep that as part of the measurement? Suggestion: We can create the analytical protocols from measurement protocols. I think this is really about how to convert the measurement protocol into ISAtab format.

  3. Is the simple 3 column assay good enough? Suggestion: Probably not. I think we would include sample prep protocols that create extracts (derivative samples) and then the analytical/measurement protocol that creates/measures the data. At least that more closely matches the original example.

ptth222 commented 1 year ago

We had some issues talking about the last part due to time and not having examples on hand so I am going to try and put those here. Link to the issue: https://github.com/MoseleyBioinformaticsLab/MESSES/issues/24

Example of sample lineage for ICMS measurement:

"15_C1-20_allogenic_7days_UKy_GCH_rep3": {
      "id": "15_C1-20_allogenic_7days_UKy_GCH_rep3",
      "protocol.id": [
        "allogenic"
      ],
      "replicate": "3",
      "species": "Mus musculus",
      "species_type": "Mouse",
      "taxonomy_id": "10090",
      "time_point": "7",
      "type": "subject"
    },
"15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3": {
      "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3",
      "parent_id": "15_C1-20_allogenic_7days_UKy_GCH_rep3",
      "protocol.id": [
        "mouse_tissue_collection",
        "tissue_quench",
        "frozen_tissue_grind"
      ],
      "type": "sample"
    },
"15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A": {
      "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A",
      "injection_volume": "10",
      "injection_volume%units": "uL",
      "parent_id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3",
      "polar_split_ratio": "0.143267710878",
      "protocol.id": [
        "polar_extraction",
        "IC-FTMS_preparation"
      ],
      "reconstitution_volume": "20",
      "reconstitution_volume%units": "uL",
      "replicate": "1",
      "replicate%type": "analytical",
      "type": "sample",
      "weight": "0.1994",
      "weight%units": "g"
    }

Short form: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A

Protocols: allogenic -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Combined: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> allogenic -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Sample inheritance example from ISA:

culture12 -> S-0.2-aliquot11 -> S-0.2 -> JC_S-0.2 culture12 -> S-0.2-aliquot11 -> S-0.2 -> Pool3

Protocols/Process: growth protocol -> protein extraction -> iTRAQ labeling -> norm3 -> datatransformation3

Combined: culture12 -> growth protocol -> S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

How ISA breaks them up: Study: culture12 -> growth protocol -> S-0.2-aliquot11 Assay: S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

Note that ISA has a processSequence and samples/files are attributes of the process where as our system is more entity focused where protocols are attributes on them.

Collection protocols are a little strange. Most protocols are describing what happened to that entity, but the collection is on the entity that resulted from the collection. For ISA the process/protocol has inputs and outputs so this ambiguity doesn't exist. We might have to move collection protocols to the input entity or handle them special for ISA and combine them with the protocols on the preceding entity because its inputs and outputs don't align with the other protocols. mouse_tissue_collection has the mouse as input and the organ as output, but tissue_quench and grind have the organ as input and the ground up organ as output.

I just realized another issue now. We put the measurement protocol on the measurement records and not the entity directly, but for ISA there are no measurements. That is to say they don't have any specific file or format for measurements. You basically just describe the protocols and list the files as outputs and then those files serve as the measurements. You don't have to pick one measurement like the Workbench makes you do and then put it in a certain format. I think the easiest thing to do is to just put the measurement protocol on the entity for ISA submissions.

It still doesn't seem obvious to me where you break the entity/protocol chain into study and assay, but we can use this as context for our next meeting.

hunter-moseley commented 1 year ago

Looks like we can use or create parent-child relationships in combination with a protocol to create the equivalent ISA input -> protocol -> output logic. In certain circumstances, we may need to create dummy output entities to create a linear chain. The collection protocols can use parentID as input and the actual entity ID as the output.

Please correct me if I am missing something.

Also, it looks like ISA study ends with a collected sample (aliquot in this example) and the ISA assay begins with the same collected sample (aliquot in this example).

On Tue, Jun 27, 2023 at 6:41 PM ptth222 @.***> wrote:

We had some issues talking about the last part due to time and not having examples on hand so I am going to try and put those here. Link to the issue: #24 https://github.com/MoseleyBioinformaticsLab/MESSES/issues/24

Example of sample lineage for ICMS measurement:

"15_C1-20_allogenic_7days_UKy_GCH_rep3": { "id": "15_C1-20_allogenic_7days_UKy_GCH_rep3", "protocol.id": [ "allogenic" ], "replicate": "3", "species": "Mus musculus", "species_type": "Mouse", "taxonomy_id": "10090", "time_point": "7", "type": "subject" }, "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3": { "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3", "parent_id": "15_C1-20_allogenic_7days_UKy_GCH_rep3", "protocol.id": [ "mouse_tissue_collection", "tissue_quench", "frozen_tissue_grind" ], "type": "sample" }, "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A": { "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A", "injection_volume": "10", "injection_volume%units": "uL", "parent_id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3", "polar_split_ratio": "0.143267710878", "protocol.id": [ "polar_extraction", "IC-FTMS_preparation" ], "reconstitution_volume": "20", "reconstitution_volume%units": "uL", "replicate": "1", "replicate%type": "analytical", "type": "sample", "weight": "0.1994", "weight%units": "g" }

Short form: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A

Protocols: allogenic -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Combined: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> allogenic -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Sample inheritance example from ISA:

culture12 -> S-0.2-aliquot11 -> S-0.2 -> JC_S-0.2 culture12 -> S-0.2-aliquot11 -> S-0.2 -> Pool3

Protocols/Process: growth protocol -> protein extraction -> iTRAQ labeling -> norm3 -> datatransformation3

Combined: culture12 -> growth protocol -> S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

How ISA breaks them up: Study: culture12 -> growth protocol -> S-0.2-aliquot11 Assay: S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

Note that ISA has a processSequence and samples/files are attributes of the process where as our system is more entity focused where protocols are attributes on them.

Collection protocols are a little strange. Most protocols are describing what happened to that entity, but the collection is on the entity that resulted from the collection. For ISA the process/protocol has inputs and outputs so this ambiguity doesn't exist. We might have to move collection protocols to the input entity or handle them special for ISA and combine them with the protocols on the preceding entity because its inputs and outputs don't align with the other protocols. mouse_tissue_collection has the mouse as input and the organ as output, but tissue_quench and grind have the organ as input and the ground up organ as output.

I just realized another issue now. We put the measurement protocol on the measurement records and not the entity directly, but for ISA there are no measurements. That is to say they don't have any specific file or format for measurements. You basically just describe the protocols and list the files as outputs and then those files serve as the measurements. You don't have to pick one measurement like the Workbench makes you do and then put it in a certain format. I think the easiest thing to do is to just put the measurement protocol on the entity for ISA submissions.

It still doesn't seem obvious to me where you break the entity/protocol chain into study and assay, but we can use this as context for our next meeting.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/MESSES/issues/24#issuecomment-1610316464, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B4FNFZCU44GTEKW2C3XNNOQ7ANCNFSM6AAAAAAZARLGBA . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093