Separate Quality Flag columns for each quality flag or each type of quality flag

Big-Life-Lab / PHES-ODM

The Public Health Environmental Surveillance Open Data Model (PHES-ODM, or ODM). A data model, dictionary and support tools for environmental surveillance.

Creative Commons Attribution Share Alike 4.0 International

54 stars 18 forks source link

Separate Quality Flag columns for each quality flag or each type of quality flag #231

Closed heather-i closed 1 year ago

heather-i commented 2 years ago

In order to further automate our data management and reporting systems to be compatible with the Ontario Data Template based on the Open Data Model it would be helpful to split out the Quality Flag column in the WWMeasure tab to more easily support when there is more than one quality flag for a data point. According to the Protocol Evaluations for qPCR Performance used by the MECP and OCWA (attached), there are the following qualifiers: B - background contamination observed (greater than 5 Cq away from samples; indicating that it would not affect quantification but it is present) FI - failed inhibition AI - addressed inhibition ND - non-detect J - concentration estimate extrapolated based on extending experiment-specific standard curve to the y-intercept UJ - "Trace" amplification of target; concentration estimate extrapolated based on extending experiment-specific standard curve beyond the y-intercept

It would be easiest if all of these had their own column that could be true/false or if they were at least grouped into the type of quality flag. Ex. a column flags that correspond to concentration (ND, J, and UJ) since each sample can only be one of these and flags that correspond to inhibition (FI and AI) since each sample can only have one of these.

I know this is very Ontario-specific so I understand if it is not possible to make these changes but if it is possible to include them and have them be optional for users of the Open Data Model who do not need these separated, then that would be wonderful.

Thanks!

20220128_ProtocolEvaluationsqPCRPerformance_January2022.pdf

jeandavidt commented 2 years ago

There was a conversation on this topic in today's ODM Implementation meeting. Multiple people present agreed that multiple quality flags could be relevant for a single measure (e.g., control curve quality concerns + inhibition quality concerns). Though qPCR was at the forefront of this discussion, it is easy to imagine that this could be the case for other types of measures as well (sequencing, sampling, etc.)

There are multiple ways to deal with a measure having many quality concerns. Some of them would require modifying the structure of the ODM slightly.

(A) Option that doesn't require changes to the ODM structure.

Only recording the "most important" quality flag for a given measure. Advantages Straightforward to implement (only documentation is needed) Drawbacks: It is unclear and possibly impossible for users to determine whether a given flag is more or less important than another.

(B) Option that requires minimal changes

Having multiple qualityFlag fields (qualityFlag1, qualityFlag2, ...) in the measures and samples tables. Advantages: This would be very straightforward to implement, and it wouldn't require the creation of new tables. Drawbacks: 1) It widens the report tables by several fields 2) It might seem like they should be interpreted as having a hierarchy (is qualityFlag1more important than qualityFlag3 because it is ranked first?) 3) It opens the door to adding lists in many other places in the ODM, potentially mucking up the overall structure.

Options that require adding new tables

The linkages between quality flags and measures / samples could be done in (at least) 2 ways:

Option C: "Loose" linkage

This option:

Removes the qualityFlag field from the measures and samplestables *Creates a table with the following fields:
- unique id (primary key)
- measureID
- sampleID
- qualityFlag Users would thus creates as many rows as they need (each linked with the specific measure or sample they want to qualify). They would need to add the quality information into this new table themselves.

Option D: "Hard" linkage

This option takes advantage from the fact that quality flags are pre-determined by the dictionary. Therefore, all the possible combinations of quality flags that could be reported can be inferred by the contents of the dictionary itself. Say, in measure x's qualitySet, that there are three possible flags: A, B, and C. We thus immediately know that the quality concerns for a measurement of x can only be one of {[], [A], [B], [C], [A,B], [A, C], [B, C], [A, B, C]}. A new table (say, qFCombinations) could be automatically be generated from the contents of each qualitySet, with each combination having its unique id.

Then , the measure and sample tables only need to replace their qualityFlag field with a qfCombinationID field to link the measure / sample to the correct combination of flags.

Advantages It maintains an explicit link between the measures ans samples tables with the quality measures, and it allows users to keep filling all their values only in the samples and measures tables.

Disadvantages The number of permutations grows geometrically with each new flag in a set, which could become unweildy over time, and it adds another step to the dictionary generation (i.e., every time a quality flag is added to a qualityflagSet, new permutations must also be added the the qfCombination table.

These options aren't exhaustive, but hopefully they get the conversation rolling on the best way forward :)

jeandavidt commented 2 years ago

Another aspect of quality that was mentioned in the meeting was how to report LOD / LOQ for measures.

The dictionary is flexible in this regard, so it would probably be good to agree on a common way of doing things.

Here are all the options I can think of:

Add loq andlod values as rows in the `measures table.
Add loq andlod as methodSteps in the MethodSet table. Then, link the measures that use that assay with the correct methodSetID
Add loq andlod values as rows in the `measures table AND link them to the relevant lab measurement with a measureSetID. Thus, by looking through all the measures of a given measureSetID, we would find for each qPCR measurement:
1. the qPCR measurement
2. its associated lod
3. its associated loq
Add a field to the measures table for lod and another for loq
If Option C is selected for the quality flags, lod and loqcould be added as fields there.

The issue I see with lod and loq being in the quality table is that then it's hard to link the value to the right unit. If lod and loq are proporties in the measure, then the unit can be assumed to be the same as for the reported value. If lod and loq are their own rows, aither in MethodSteps or measures, they can have their own unit without having to worry about the unit used by specific measurements.

heather-i commented 2 years ago

Thank you for summarizing the discussion from the ODM implementation meeting and clearly describing the advantages and drawbacks of each option!

I will break up my thoughts into Ontario Data Template/MECP-specific notes and general ODM notes:

Ontario Data Template/MECP-specific notes

Option B would be my preference.

Pros:

allows for the immediate inclusion of multiple quality flags associated with RT-qPCR data
no new table to try to format and maintain
easiest to fit into our lab's (University of Waterloo) current data analysis and reporting workflow

Cons:

leaving it open-ended as to which quality flag will go into column1, column2 would be problematic for filtering data and using it further. This could be remedied by allowing users (MECP, managers of the Ontario Data Template) to define which quality flag corresponds to which column (ex. qualityflag1 = quantity flags; ND, <LOD, <LOQ and qualityflag2 = inhibition flags; FI, AI) and qualityflag3 = contamination flags; B)

General ODM notes

If I understand correctly, the ODM is set up to be formatted for a number of different users and so the WWMeasure table can be used for data produced from labs measuring any biologic, toxin, or other health risk, using any number of techniques or assays. Therefore, there is the necessity to make it both very flexible to accommodate all possible uses/users as well as customizable to it is able to capture highly precise data for each use/user. This may be a very naïve understanding of the ODM so please take the following comments lightly.

I propose to give users (MECP in my case) the ability to select a customized WWMeasure table based on the assayMethod. For example, if the assay method is RT-qPCR, the WWMeasure table will have quality flag columns for this technique but if the assay method is sequencing, the WWMeasure table would be altered to accommodate that data type.

Advantages: ensures data from any assay is being recorded with all caveats/flags so that only the highest quality data is being used for interpretation; increases user friendliness when reporting as the columns are understood by the labs/persons producing the data from each type of assay.

Drawbacks: I would anticipate that this would be a lot of work to coordinate this and do not want to take that lightly.

LOD/LOQ thoughts

I believe this should be part of the Quality flag column (as it currently is in the ODT) as these values can change over time as improvements/changes to assays occur, so it is easier to note if each of the values reported in the WWMeasure table are below the LOD or LOQ at the time of reporting.

Note: I am also in the process of communicating this to Vince Pileggi and Sherif Hegazy (MECP; points of contact for the Ontario Data Template) so again please ignore if these changes/thoughts are not relevant to the ODM itself.

DougManuel commented 2 years ago

Just a clarification that this issue discussion is referencing versions 1.1 and 2.0. @heather-i references are mostly about v1.1. @jeandavidt references are mostly about version 2.0.

Version 2.0 expands the dictionary and the model quite a bit. The name change from WWMeasure (wastewater measure) to measures reflects that measures can be for water, air or surface and more robustly include population measures (testing, hospitalization, etc.).

In version 2.0: 1) LOD and LOQ change from headers within the measures tables to what is described by @jeandavidt in this issue (a row in the measure table linked to measures using measureSetID or within the methods table). There are a few reasons for this change. Most notably, LOD and LOQ are relevant to specific measures, such as PCR measures. There is a considerable increase in measures, such as chemical and physical properties, where LOD and LOQ don't apply.

2) Quality sets (qualitySet) are introduced. Currently, four quality sets are described, but more can be added at any time: Generic Quality Flag Set; PCR Quality Set; Sample Quality Set; Sequencing Quality Set. Each measure can have a quality set.

As an aside, in Version 2.0, there are also aggregation sets (aggSet), and unit sets (unitSet). So, each measure has an aggregation set, unit set and quality set. A unit set for temperature (degrees celsius) is different from a unit set for SARS-CoV-2 N1 gene region detection by PCR (gene copies per l, gene copies per copies of PPMoV, etc.)

3) Measure sets (measureSetID) are introduced. Measures sets allow groups of measures to be associated with each other. There are several use cases for measure sets, but they are generic and flexible. Associating LOQ and LOD to a group of measures was one identified use case. Other use cases include:

variants - when performing variant testing, there may be multiple variants identified, etc.
controls - generating and reporting Ct curves or performing dilutions, spike samples, etc.

Measures and samples can also be grouped, but there are slightly different considerations. Samples have the provision for having parent, child, combined samples, etc. Methods have methodSteps that can be grouped, and then groups can be combined. For example, there could be several RNA extraction steps that can be grouped together and then added with other groups of steps for, say, concentration, PCR, etc. to form an overall assay method.

DougManuel commented 2 years ago

Remember that we’ll want our quality measures to work in both ‘long’ and ‘wide’ data formats. I don’t foresee major issues with any proposed solutions, but there are a few considerations and implementation issues. We’ll likely want the core ODM development team to review how to generate long names before we sign off on a reporting approach.

Long data is the main ODM data format, but version 2 provides better support for wide tables with an explicit formula for generating wide names. Variable names for wide tables can get very long because the names are a connotation of attributes. See below. This means that we’ll want short part names for quality measures.

The figure below is preliminary and not quite up-to-date. Regardless the figure informs the general approach.

DougManuel commented 2 years ago

For Option B, what is the implementation? Do we need key:value pairs? Maybe even key:value:unit (for numerical quality measures)? @mathew-thomson @heather-i

1) Key:value pairs: qf1_partID, qf1_value, qf2_partID, qf2_value. qf1_partID= J, qf1_value = TRUE.

2) Use the partID as the name, and then the entry is the value. qf1_J qf1_J = True.

3) Have the quality measure as the value and assume TRUE. qf1 qf1 = J

The above approaches also need to work for quality measures that are not Boolean but real numbers: measures such as LOQ, concentration estimate, etc. Key value pairs for these measures, and also implementation 2. The value measures need an accompanying unit and maybe also an aggregation.

A challenge for implementation 2 is a proliferation of qf1_ variables as headers. Remember that there are measures other than PCR. Currently, in version 2, there are 22 quality measures, which would mean adding 22 variable headers to the measuring table -- of which most are not relevant for any one measure. Now you've got a wide-table format instead of a long-table format.

jeandavidt commented 2 years ago

So for version 2.0, we propose going for option C:

A new quality table is created
Each row of the quality table creates a new quality flag
Quality flags can be linked to either of the following (1 per row): measure report, measure set report, sample report
Any sample, measure set or measure can have as many flags as required

jeandavidt commented 2 years ago

Addressing this problem raised the issue of measure sets vs sets of quality flags. The question was: since we are now linking several quality flags to a measure, isn't this the same as creating a measure set, but for quality? I looked into this, and they turn out to be different. The difference is in the number of links the different entities can have together:

A measure, measure Set or sample can be linked to many quality flags
A measure set can be linked to many measures
One quality flag can only be linked to a single measure or measure set at the same time.
At the moment, a measure can be linked to a single measure set at the same time.

So:

the linkage between measures and qualityflags is n:1 and
the linkage between measures and measure sets is 1: n i.e., the directionality of the one-to-many relationship is inverted. So we can't use the quality table to replace the measure set table.

jeandavidt commented 2 years ago

But a thing we might want to do is to allow one measure to belong in many sets (say, a set of replicates and a set of all the measures that were done with the same calibration curve). For that, we need to turn the relationship between measures and measureSets from n:1 to n:n

n:n linkages require a lookup table. Thus, the setup would be the following:

The measureReport Table loses its measureSetReportID field
The measureSetReport table is still there to store the unique ID of each new set. We can also choose to give names to the sets and / or a type (e.g., qualitycontrolSet or ReplicateSet or whatever)
A new lookup table (maybe, MeasureSetHasMeasures) that has no primary key field and 2 foreign keys as fields (MeasureSetID, measureReportID)

I put these changes into the ERD for review and discussion

DougManuel commented 2 years ago

All points by @jeandavidt look good.

The first task we need to complete is how to store data that address required use cases for quality measures. What @jeandavidt suggested does address use cases that have been discussed - in particular, the main issue thread and the ability to store multiple quality measures.

The second task is to recommend data easy data collection for common uses. @heather-i has a good point that Option B is easy to implement and understand for many users and use cases.

As an interim step, we could implement Option B for version 1.2 and then have a more robust qualityFlag table for version 2.
In practice, people could generate input templates or data collection that has option B as wide variable names.
- mixing wide and long variables works if people use a data collection table for selected measures such as PCR alleles, but the solution is not robust for multiple measure with different qualityFlagSets.
- we probably want default ODM templates that are robust and so they wouldn't include this practice.

DougManuel commented 2 years ago

We will need to decide whether we want the 'reportable' attribute and how that would be used. From the discussions:

reportable is an important attribute that we want to keep. This attribute is widely requested and it is also a helpful flag that there is or should be a corresponding entry in the qualitySet table.
There is the greatest support to have as a boolean. True = 'report'; False= 'don't report'. However, some people would prefer more categories.
It is possible to have a more nuanced summary reportable attribute in the qualitySet table and to support a nice, clean boolean 'reportable' attribute in the measure and sample tables.

DougManuel commented 2 years ago

I tend to support updating the measureSetReport table to allow n:n, but we haven't received many requests that require this more robust structure. However, the more robust structure:

is more consistent with the approach of the ODM model.
similar to qualityFlag, we could have a flag in the measures table that says the measure is part of a measureSet. This helps users know to add entries to measureSet (if data generators) or look at the measureSet table (if data users).
for data storage, we would remove mesaureSetReportID from the measures table. But similar to quality flags, data input tables could have this measure for users who have only one measureSet for any specific measure.

Regardless, the idea to add additional descriptors makes sense (e.g., qualitycontrolSet or ReplicateSet or whatever) and those descriptors could be added to the existing measureSet table.

I am not sold on MeasureSetHasMeasures. I find the 'has' tables are conceptual great, but many people are not familiar with them or their application.

mathew-thomson commented 2 years ago

Thank you @jeandavidt for laying things out so eloquently, and to build on @DougManuel 's point re: reportable - one thing that was brought up in our discussion was to continue to use reportable in the measures table and add it to the samples table. To add potential levels of nuance, while still maintaining a final ease of interpretation, a severity column in the proposed qualityReports table with a traffic-light-style tier system. This wouldn't replace reportable, and would be optional, but provides some additional detail on how important a given flag is outside the final yes-no reportable decision.

I also am somewhat conservative about blowing up the measureSets structure to allow for n:n relationships with measures, but if the labs are supportive of this kind of infrastructure then I think it would be great to build it in before we launch v2.0.

DougManuel commented 1 year ago

Version 2 will have a specific quality table that can record multiple quality measures for any sample or measure.