ISA Assay Creation Issues

ptth222 commented 1 year ago

You may want to view this on GitHub since I have embedded tables and such. https://github.com/MoseleyBioinformaticsLab/MESSES/issues

I am going to start with a short description of how ISA assays work. I am going to work from the tab format because it is easier to understand, but the issues exist in the JSON version as well.

Here is an example ISA assay tab file:

Sample Name	Protocol REF	Extract Name	Protocol REF	Labeled Extract Name	Label	MS Assay Name	Comment[PRIDE Accession]	Comment[PRIDE Processed Data Accession]	Raw Spectral Data File	Normalization Name	Protein Assignment File	Peptide Assignment File	Post Translational Modification Assignment File	Data Transformation Name	Derived Spectral Data File
S-0.1-aliquot11	protein extraction	S-0.1	ITRAQ labeling	JC_S-0.1	iTRAQ reagent 117	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml
C-0.1-aliquot11	protein extraction	C-0.1	ITRAQ labeling	JC_C-0.1	iTRAQ reagent 116	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml
N-0.1-aliquot11	protein extraction	N-0.1	ITRAQ labeling	JC_N-0.1	iTRAQ reagent 115	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml
S-0.1-aliquot11	protein extraction	S-0.1	ITRAQ labeling	Pool1	iTRAQ reagent 114	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml
C-0.1-aliquot11	protein extraction	C-0.1	ITRAQ labeling	Pool1	iTRAQ reagent 114	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml
N-0.1-aliquot11	protein extraction	N-0.1	ITRAQ labeling	Pool1	iTRAQ reagent 114	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml
C-0.2-aliquot11	protein extraction	C-0.2	ITRAQ labeling	JC_C-0.2	iTRAQ reagent 117	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml
N-0.2-aliquot11	protein extraction	N-0.2	ITRAQ labeling	JC_N-0.2	iTRAQ reagent 116	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml
P-0.1-aliquot11	protein extraction	P-0.1	ITRAQ labeling	JC_P-0.1	iTRAQ reagent 115	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml
C-0.2-aliquot11	protein extraction	C-0.2	ITRAQ labeling	Pool2	iTRAQ reagent 114	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml
N-0.2-aliquot11	protein extraction	N-0.2	ITRAQ labeling	Pool2	iTRAQ reagent 114	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml
P-0.1-aliquot11	protein extraction	P-0.1	ITRAQ labeling	Pool2	iTRAQ reagent 114	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml
P-0.2-aliquot11	protein extraction	P-0.2	ITRAQ labeling	JC_P-0.2	iTRAQ reagent 116	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml
S-0.2-aliquot11	protein extraction	S-0.2	ITRAQ labeling	JC_S-0.2	iTRAQ reagent 115	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml
P-0.2-aliquot11	protein extraction	P-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 117	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml
S-0.2-aliquot11	protein extraction	S-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 117	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml
P-0.2-aliquot11	protein extraction	P-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 114	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml
S-0.2-aliquot11	protein extraction	S-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 114	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml

The order of the columns matter. The very first column must be a Sample Name column, and there can be no other Sample Name columns. If you have multiple entities deriving from each other they are called "extracts" in an assay. You can see in this example it goes from a sample to an extract after the "protein extraction" protocol, and then to a labeled extract after the "ITRAQ labeling" protocol. They call the "Protocol REF" columns "process nodes", but there are also other process nodes. The "MS Assay Name", "Normalization Name", and "Data Transformation Name" columns are also process nodes, but there is a difference. Protocol process nodes have a protocol, but the other ones don't, the names underneath of them just name the process, not the protocol. It seems to me like they essentially make a distinction between actions done on physical entities and actions done on data. Actions done on physical entities have an associated protocol, but actions done on data don't. They don't expressly say that, but that's what it looks like based on the example. I also think it would be valid to change "MS Assay Name" to a "Protocol REF" and create a "MS Assay" protocol if someone wanted. That would be the only way to give a description of the "MS Assay" process.

An important thing to note about all of this is that each sample/extract can have only 1 process/protocol applied to it at a time. This is an issue because we allow "protocol.id" to be a list field for entities. I think we might have to enforce 1 protocol for entities for ISA conversions. This restriction isn't just for assays, it applies to study processes as well.

Also note that they actually show analysis type process steps where as we typically don't. It isn't required, but we may want to think about adding some "analytical" type protocols or something if people do want to specify it similarly to what ISA shows here.

One issue is deciding where in the sample/extract chain to create the assay. This example starts just before protein extraction, but we could make one as simple as 3 columns. For example:

Sample Name	Protocol REF	Raw Spectral Data File
Sample1	ICMS1	File1
Sample2	ICMS1	File2
Sample3	ICMS1	File3

This is simple to do and would just require looking at the measurement protocol, but if we wanted to start sooner we would have to go to the measurement entity and then just go back up the lineage to some point. Deciding where to stop could be difficult. Maybe just 1 hop up the lineage.

To summarize:

Do we limit protocols to 1 per entity for ISA submissions, or come up with something else?
Do we add analytical type protocols or keep that as part of the measurement?
Is the simple 3 column assay good enough?

hunter-moseley commented 1 year ago

We should talk this out, but here are my suggestions after first reading these issues.

Do we limit protocols to 1 per entity for ISA submissions, or come up with something else? Suggestion: We could create a composed protocol to list one per entity, but that it refers to a list of protocols. Not exactly sure how this would work. It depends on how protocols are described.
Do we add analytical type protocols or keep that as part of the measurement? Suggestion: We can create the analytical protocols from measurement protocols. I think this is really about how to convert the measurement protocol into ISAtab format.
Is the simple 3 column assay good enough? Suggestion: Probably not. I think we would include sample prep protocols that create extracts (derivative samples) and then the analytical/measurement protocol that creates/measures the data. At least that more closely matches the original example.

ptth222 commented 1 year ago

We had some issues talking about the last part due to time and not having examples on hand so I am going to try and put those here. Link to the issue: https://github.com/MoseleyBioinformaticsLab/MESSES/issues/24

Example of sample lineage for ICMS measurement:

"15_C1-20_allogenic_7days_UKy_GCH_rep3": {
      "id": "15_C1-20_allogenic_7days_UKy_GCH_rep3",
      "protocol.id": [
        "allogenic"
      ],
      "replicate": "3",
      "species": "Mus musculus",
      "species_type": "Mouse",
      "taxonomy_id": "10090",
      "time_point": "7",
      "type": "subject"
    },
"15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3": {
      "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3",
      "parent_id": "15_C1-20_allogenic_7days_UKy_GCH_rep3",
      "protocol.id": [
        "mouse_tissue_collection",
        "tissue_quench",
        "frozen_tissue_grind"
      ],
      "type": "sample"
    },
"15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A": {
      "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A",
      "injection_volume": "10",
      "injection_volume%units": "uL",
      "parent_id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3",
      "polar_split_ratio": "0.143267710878",
      "protocol.id": [
        "polar_extraction",
        "IC-FTMS_preparation"
      ],
      "reconstitution_volume": "20",
      "reconstitution_volume%units": "uL",
      "replicate": "1",
      "replicate%type": "analytical",
      "type": "sample",
      "weight": "0.1994",
      "weight%units": "g"
    }

Short form: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A

Protocols: allogenic -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Combined: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> allogenic -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Sample inheritance example from ISA:

culture12 -> S-0.2-aliquot11 -> S-0.2 -> JC_S-0.2 culture12 -> S-0.2-aliquot11 -> S-0.2 -> Pool3

Protocols/Process: growth protocol -> protein extraction -> iTRAQ labeling -> norm3 -> datatransformation3

Combined: culture12 -> growth protocol -> S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

How ISA breaks them up: Study: culture12 -> growth protocol -> S-0.2-aliquot11 Assay: S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

Note that ISA has a processSequence and samples/files are attributes of the process where as our system is more entity focused where protocols are attributes on them.

Collection protocols are a little strange. Most protocols are describing what happened to that entity, but the collection is on the entity that resulted from the collection. For ISA the process/protocol has inputs and outputs so this ambiguity doesn't exist. We might have to move collection protocols to the input entity or handle them special for ISA and combine them with the protocols on the preceding entity because its inputs and outputs don't align with the other protocols. mouse_tissue_collection has the mouse as input and the organ as output, but tissue_quench and grind have the organ as input and the ground up organ as output.

I just realized another issue now. We put the measurement protocol on the measurement records and not the entity directly, but for ISA there are no measurements. That is to say they don't have any specific file or format for measurements. You basically just describe the protocols and list the files as outputs and then those files serve as the measurements. You don't have to pick one measurement like the Workbench makes you do and then put it in a certain format. I think the easiest thing to do is to just put the measurement protocol on the entity for ISA submissions.

It still doesn't seem obvious to me where you break the entity/protocol chain into study and assay, but we can use this as context for our next meeting.

hunter-moseley commented 1 year ago

Looks like we can use or create parent-child relationships in combination with a protocol to create the equivalent ISA input -> protocol -> output logic. In certain circumstances, we may need to create dummy output entities to create a linear chain. The collection protocols can use parentID as input and the actual entity ID as the output.

Please correct me if I am missing something.

Also, it looks like ISA study ends with a collected sample (aliquot in this example) and the ISA assay begins with the same collected sample (aliquot in this example).

On Tue, Jun 27, 2023 at 6:41 PM ptth222 @.***> wrote:

We had some issues talking about the last part due to time and not having examples on hand so I am going to try and put those here. Link to the issue: #24 https://github.com/MoseleyBioinformaticsLab/MESSES/issues/24

Example of sample lineage for ICMS measurement:

"15_C1-20_allogenic_7days_UKy_GCH_rep3": { "id": "15_C1-20_allogenic_7days_UKy_GCH_rep3", "protocol.id": [ "allogenic" ], "replicate": "3", "species": "Mus musculus", "species_type": "Mouse", "taxonomy_id": "10090", "time_point": "7", "type": "subject" }, "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3": { "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3", "parent_id": "15_C1-20_allogenic_7days_UKy_GCH_rep3", "protocol.id": [ "mouse_tissue_collection", "tissue_quench", "frozen_tissue_grind" ], "type": "sample" }, "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A": { "id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A", "injection_volume": "10", "injection_volume%units": "uL", "parent_id": "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3", "polar_split_ratio": "0.143267710878", "protocol.id": [ "polar_extraction", "IC-FTMS_preparation" ], "reconstitution_volume": "20", "reconstitution_volume%units": "uL", "replicate": "1", "replicate%type": "analytical", "type": "sample", "weight": "0.1994", "weight%units": "g" }

Short form: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A

Protocols: allogenic -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Combined: 15_C1-20_allogenic_7days_UKy_GCH_rep3 -> allogenic -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3 -> mouse_tissue_collection -> tissue_quench -> frozen_tissue_grind -> 15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICMS_A -> polar_extraction -> IC-FTMS_preparation -> ICMS1

Sample inheritance example from ISA:

culture12 -> S-0.2-aliquot11 -> S-0.2 -> JC_S-0.2 culture12 -> S-0.2-aliquot11 -> S-0.2 -> Pool3

Protocols/Process: growth protocol -> protein extraction -> iTRAQ labeling -> norm3 -> datatransformation3

Combined: culture12 -> growth protocol -> S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

How ISA breaks them up: Study: culture12 -> growth protocol -> S-0.2-aliquot11 Assay: S-0.2-aliquot11 -> protein extraction -> S-0.2 -> iTRAQ labeling -> JC_S-0.2/Pool3 -> norm3 -> datatransformation3

Note that ISA has a processSequence and samples/files are attributes of the process where as our system is more entity focused where protocols are attributes on them.

Collection protocols are a little strange. Most protocols are describing what happened to that entity, but the collection is on the entity that resulted from the collection. For ISA the process/protocol has inputs and outputs so this ambiguity doesn't exist. We might have to move collection protocols to the input entity or handle them special for ISA and combine them with the protocols on the preceding entity because its inputs and outputs don't align with the other protocols. mouse_tissue_collection has the mouse as input and the organ as output, but tissue_quench and grind have the organ as input and the ground up organ as output.

I just realized another issue now. We put the measurement protocol on the measurement records and not the entity directly, but for ISA there are no measurements. That is to say they don't have any specific file or format for measurements. You basically just describe the protocols and list the files as outputs and then those files serve as the measurements. You don't have to pick one measurement like the Workbench makes you do and then put it in a certain format. I think the easiest thing to do is to just put the measurement protocol on the entity for ISA submissions.

It still doesn't seem obvious to me where you break the entity/protocol chain into study and assay, but we can use this as context for our next meeting.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/MESSES/issues/24#issuecomment-1610316464, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B4FNFZCU44GTEKW2C3XNNOQ7ANCNFSM6AAAAAAZARLGBA . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

MoseleyBioinformaticsLab / MESSES

ISA Assay Creation Issues #24

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.