episphere / connect

Connect API for DCEG's Cohort Study
10 stars 5 forks source link

RCA Variables and Concept IDs to submitParticipantData API #675

Closed cunnaneaq closed 9 months ago

cunnaneaq commented 1 year ago

The below variables have been added to the data dictionary. The sites will send as many data points as possible for a participant diagnosed with cancer through the submitParticipantData API. Happy to provide more info as needed. Thanks!

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Concept ID | Question Text | Variable Name -- | -- | -- 131461026 | How many occurrences of cancer has the participant had? | RCAOper_NumCancOcc_v1r0 525972260 | Was the Connect participant diagnosed with cancer? | RCAOper_CancerFlag_v1r0 345545422 | What date was the participant identified as having cancer (the date of case ID)? | RCAOper_CancerDt_v1r0 939782495 | Anal | RCAOper_AnalCancer_v1r0 135725957 | Bladder | RCAOper_BladderCancer_v1r0 518416174 | Brain | RCAOper_BrainCancer_v1r0 847945207 | Breast | RCAOper_BreastCancer_v1r0 283025574 | Cervical | RCAOper_CervicalCancer_v1r0 942970912 | Colon/rectal | RCAOper_ColonCancer_v1r0 596122041 | Esophageal | RCAOper_EsophCancer_v1r0 489400183 | Head and neck (Including cancers of the mouth, sinuses, nose, or throat. Not including brain or skin cancers.) | RCAOper_HeadNeckCancer_v1r0 863246236 | Kidney | RCAOper_KidneyCancer_v1r0 607793249 | Leukemia (blood and bone marrow) | RCAOper_Leukemia_v1r0 532172400 | Liver | RCAOper_LiverCancer_v1r0 754745617 | Lung or bronchial | RCAOper_LungCancer_v1r0 665036297 | Non-Hodgkin's lymphoma | RCAOper_NHLymphoma_v1r0 200837530 | Lymphoma | RCAOper_Lymphoma_v1r0 990319383 | Melanoma (skin) | RCAOper_Melanoma_v1r0 487917585 | Non-melanoma skin (basal or squamous) | RCAOper_NonMelanoma_v1r0 603181162 | Ovarian | RCAOper_OvarianCancer_v1r0 482225200 | Pancreatic | RCAOper_PancreaCancer_v1r0 295976386 | Prostate | RCAOper_ProstateCancer_v1r0 764891959 | Stomach | RCAOper_StomachCancer_v1r0 248374037 | Testicular | RCAOper_TesticCancer_v1r0 139822395 | Thyroid | RCAOper_ThyroidCancer_v1r0 723614811 | Uterine (endometrial) | RCAOper_UterineCancer_v1r0 807835037 | Other \|__\| | RCAOper_OtherCancer_v1r0 868006655 | Primary Cancer Site- Another type of cancer: Please describe [text box] | RCAOper_OthCancerDesc_v1r0 178420302 | Unavailable/Unknown | RCAOper_UnknownCancer_v1r0 599128909 | What is the ICD code for the primary cancer site diagnosed on 'date of case ID'? | RCAOper_CancSiteICD_v1r0 129461711 | Which version of ICD coding did you use to report the primary cancer site? | RCAOper_ICDCodeVer_v1r0 457270069 | What is the preliminary stage information for the cancer diagnosed on 'date of case ID'? | RCAOper_PrelimCancStg_v1r0 772638539 | How many tumors were reported for the cancer diagnosed on 'date of case ID'? | RCAOper_TumorNum_v1r0

cunnaneaq commented 1 year ago

This is the current version of the JSON - https://nih.app.box.com/file/1312388711307 @mnataraj92 is updating the data dictionary now

JoeArmani commented 1 year ago

Hi @cunnaneaq, Thanks for the additional information. The doc answers most of my preliminary questions. Does the November scope include the need to trigger any surveys, or is it strictly receiving the data and putting it into the correct data structure?

cunnaneaq commented 1 year ago

@JoeArmani focusing on receiving data in the correct structure for now. Tentative timeline for the survey triggers is Feb-March '24.

mnataraj92 commented 1 year ago

dictionary is updated w/nesting in column AF

JoeArmani commented 1 year ago

Perfect, thanks @cunnaneaq and @mnataraj92.

(1) I understand there can be many 'occurrences' per participant over time, my question is whether more than one 'occurrence' can happen each time the site submits data for a participant?

(2) Documenting that none of these are state variables in the data dictionary. These new variables are part of a new data structure in the participant profile. Please correct me if any of them are state variables.

(3) Will this data come in flat from the sites, or will it be nested? Examples below. Either is completely fine. It just impacts how I design the operation server-side. EDIT/UPDATE 11/2: I was added to the email thread about this issue. Based on that conversation, the sites will send nested data. This fits with other data submitted to the same endpoint.

/*Site Submission Examples:*/

// Flat (all variables at the same level, we transform server-side):
data: {
    525972260: 353358909,
    637153953: [] /*site WOULD NOT submit the 'occurrence' cId, we would generate the structure server-side*/,
    345545422: Timestamp,
    599128909: ICD-O-3,
    740819233: [] /*site would not submit the 'primary site of cancer' cId, we would generate the structure server-side*/,
    135725957: 353358909 /*cancer types listed at same level as everything else*/,
    863246236: 104430631,
    // etc...
}

// Nested (site submits the data in the pre-designed nested structure):
// EDITED: 11/3 with clearer understanding of data structure. 
data: {
    525972260: 353358909,
    637153953: [ /*site would submit the 'occurrence' cId with the pre-structured data*/
        {
            345545422: Timestamp,
            599128909: ICD-O-3,
            740819233: { /*site would submit the 'primary site of cancer' cId with the pre-structured data*/
                135725957: 353358909 /*cancer types nested the way they will be in Firestore*/,
                863246236: 104430631,
                // etc...
            },  
            // etc...
        },
    ],
}

(4) Regarding 'no' values: Do we want to include the 'no' value explicitly, or do we want that data point to be null? Example: if a participant has a 'yes' value for melanoma and a 'no' value for lymphoma, do we want option A or B? Option A probably provides more complete data for the analytics team. We do not need to require sites to submit 'no' values in either case. EDIT/UPDATE 11/2: I will plan on 'Option A' for this data unless I hear otherwise. This seems to be in line with other data in our system (explicit 'no' values are preferred).

/*Handling 'No' Values:*/

Option A:
990319383: 353358909,
200837530: 104430631,
// etc...all no values are explicitly listed in this option.

Option B:  
990319383: 353358909,
// all 'no' values are omitted (null) in this option.
JoeArmani commented 12 months ago

Updates: • 131461026 has been removed. • If that count is needed, we can get that value from the number of 'occurrence' objects.

cunnaneaq commented 12 months ago

Hi Joe, I'll do my best to answer your questions but also welcome input/clarification from @mnataraj92 and @Davinkjohnson

  1. Yes, we'd like to prepare for this scenario. I created a test plan for the sites to send data in different scenarios and think it may be a helpful reference. We think that a participant could have 2 occurrences in the same API push either with the same date of case ID or with different dates of case ID.

  2. @Davinkjohnson can you confirm Joe's statement that none of the variables submitted will be state variables?

  3. Nested, please!

  4. I think @mnataraj92 was working on confirming this.

JoeArmani commented 12 months ago

Thanks @cunnaneaq!

mnataraj92 commented 12 months ago

Hi @JoeArmani! for #4 regarding "no" values, I agree that explicit 'no' values are preferred. I was just wondering if any of these were going to be default variables? If so, I can indicate that in the dictionary.

JoeArmani commented 11 months ago

Hi @mnataraj92, We already discussed this on Teams - thanks for answering my questions about default variables. Please disregard. I'm just adding here for documentation purposes.

EDITED 11/7: This should be the complete list of cancer sites. These variables will NOT be included unless specified in the POST request. Missing values in the POST request will be null in the participant profile. "939782495": 104430631, "135725957": 104430631, "518416174": 104430631, "847945207": 104430631, "283025574": 104430631, "942970912": 104430631, "596122041": 104430631, "489400183": 104430631, "863246236": 104430631, "607793249": 104430631, "532172400": 104430631, "754745617": 104430631, "665036297": 104430631, "200837530": 104430631, "990319383": 104430631, "487917585": 104430631, "603181162": 104430631, "482225200": 104430631, "295976386": 104430631, "764891959": 104430631, "248374037": 104430631, "139822395": 104430631, "723614811": 104430631, "807835037": 104430631, "178420302": 104430631

JoeArmani commented 11 months ago

IMPORTANT UPDATE: We have discussed this implementation and have the consensus this is better suited to the updateParticipantData API. The submitParticipantsData API is meant for single use (new participants). Development will be happening on the 'update' API.

JoeArmani commented 11 months ago

Update per @cunnaneaq: 599128909, 129461711, and 772638539 have been omitted from this API update as of this morning, 11/8.

JoeArmani commented 11 months ago

Update: PR submitted for review https://github.com/episphere/connectFaas/pull/458

sonyekere commented 11 months ago

Per Michelle and based on the above, Aileen and Madhuri will be leading the dev and stage testing.

cunnaneaq commented 11 months ago

Decision to allow API push to pass without stage information. If stage information included only text with < 800 characters will pass. Decision for API push to fail if primary cancer information is not provided. @JoeArmani just documenting confirmation here

cunnaneaq commented 11 months ago

@jacobmpeters @KELSEYDOWLING7 Tagging you here for info on the RCA to prepare for QC rules

jacobmpeters commented 11 months ago

Thanks, Aileen.

jacobmpeters commented 11 months ago

I updated the flattening of the participants table in dev. All of the new variables that have been tested here should now be available in the FlatConnect.participants_JP table in dev.

Gbarra9 commented 11 months ago

PR review status - Reviewing

Gbarra9 commented 11 months ago

PR review status - Approved, left comments

JoeArmani commented 11 months ago

Here's the initial documentation for RCA Variables: List of the variables with descriptions (excel): https://nih.app.box.com/file/1367482173536 Doc with passing and failing examples for developers at the sites: https://nih.app.box.com/file/1367487664518

JoeArmani commented 11 months ago

This has been tested in dev. It's ready for stage.

FrogGirl1123 commented 11 months ago

Hi @mnataraj92, We already discussed this on Teams - thanks for answering my questions about default variables. Please disregard. I'm just adding here for documentation purposes.

EDITED 11/7: This should be the complete list of cancer sites. These variables will NOT be included unless specified in the POST request. Missing values in the POST request will be null in the participant profile. "939782495": 104430631, "135725957": 104430631, "518416174": 104430631, "847945207": 104430631, "283025574": 104430631, "942970912": 104430631, "596122041": 104430631, "489400183": 104430631, "863246236": 104430631, "607793249": 104430631, "532172400": 104430631, "754745617": 104430631, "665036297": 104430631, "200837530": 104430631, "990319383": 104430631, "487917585": 104430631, "603181162": 104430631, "482225200": 104430631, "295976386": 104430631, "764891959": 104430631, "248374037": 104430631, "139822395": 104430631, "723614811": 104430631, "807835037": 104430631, "178420302": 104430631

@JoeArmani could you please clarify, "Missing values in the POST request will be null in the participant profile". Based on prior posts don't you mean "No" instead of "Null" ?

FrogGirl1123 commented 11 months ago

As decided on the 11/29 punch list call, the nesting structure for the RCA data will be changed due to issues flattening arrays in the participants table. As a result of needing to make these changes, RCA won't be released until Jan.

@JoeArmani please reach out to Nicole when you return from PTO to review the changes to the data structure.

JoeArmani commented 11 months ago

@FrogGirl1123 Re: null primary cancer site variables - after a conversation with Davin, the decision was not to set missing primary cancer site (yes/no) variables so they will be missing (null) in Firestore if not sent in the POST request.

Davin's reasoning was twofold: (1) 'not yes' is not equivalent to 'no' (2) This saves on data usage and null values can be interpreted as 'no' or 'not yes' if desired.

So, in its current form, data is validated for at least one yes primary cancer site value OR a non-empty string in CID 868006655. If one of those conditions is met -> pass. Else fail. Additionally, timestamp 345545422 is required. If missing or badly formatted, the request fails.


Re: updated data structure. I'm back and nearly caught up - I'll reach out to you shortly, thanks.

JoeArmani commented 11 months ago

Update: we met today to discuss data issues experienced when flattening data to BQ. We'll follow up and finalize data structure decisions next week after a discovery period - I'll update documentation at that time if changes are made.

brotzmanmj commented 10 months ago

@JoeArmani Hi Joe, can you update with the latest status and the final decision for data structure? Thanks!

JoeArmani commented 10 months ago

Hi @brotzmanmj, definitely. We're meeting at noon with the goal of finalizing this data structure, I hope to have that update posted by EOD, thanks.

brotzmanmj commented 10 months ago

thanks!

JoeArmani commented 10 months ago

@brotzmanmj Hi Michelle, I’m about to write up the final details. We have one final question for you below.

Overview: •This data will come in from sites through the updateParticipantData API and will be stored in a new Firestore collection ‘cancerOccurrence’. •The participant’s Connect ID and the sequential number of the occurrence will be stored with this data (i.e. occurrence 1, 2, 3, etc.) alongside the detailed occurrence data.

Question: There is one item we’re requesting your decision on:

Concept ID 525972260 is a yes/no value with the question: “Was the connect participant diagnosed with cancer?”

Originally, this was scheduled as a data point for the participant profile - to identify a positive ‘this patient has been diagnosed with cancer’ value.

The new data structure in cancer occurrences implies this value, but we’re not sure what future use-cases this might have.

So the question is how you’d like us to use this value: Do you want us to include 525972260 -in the cancerOccurrence data? -in the participant data? -in both? -in neither?

Thanks!

brotzmanmj commented 10 months ago

Hi @JoeArmani, It's a good question. So perhaps this decision changed today but I think when we discussed last week, the Occurrences table was going to be built to be adaptable to include other illness and not just cancer, in which case the question might not be implied. Was it decided today that the table would just be CancerOccurrences? That might impact the answer to your question.

JoeArmani commented 10 months ago

Hi @brotzmanmj, That makes sense. It was decided that this collection will specifically be cancerOccurrences. I believe this is due to the data team's BQ needs.

My understanding is the following: Since the data points in the cancer occurrence object are specific to cancer occurrences, adding other types of occurrence data with different specifications would be detrimental to the complexity of the tables the data team needs to work with. Other occurrence data would widen and have unpopulated data types. I think it benefits the data team to have separate tables for each occurrence category in the future.

I'm going to request input from @jacobmpeters on this as well.

brotzmanmj commented 10 months ago

That makes sense to me. In that case, the sites will only send data to CancerOccurrence table for participants with cancer occurrences known to the site, so it would make sense to me that it is implied and not necessary for them to send '525972260 “Was the connect participant diagnosed with cancer?" as part of the cancer occurrence table data. However, the list of variables requested for RCA were developed by Marie Josephe so I will defer to her. I'll email her and copy you. As far as the Participants table is concerned, I think it's not necessary to have this variable there either because in the next few months, we will change the SMDB to interact with the CancerOccurrences table.

jacobmpeters commented 10 months ago

@brotzmanmj @jacobmpeters

@FrogGirl1123 Communicated this suggestion to have the collection be a generic "occurrences" collection, so that future types of occurrences other than cancer diagnoses could be included in this collection. I mentioned that if new variables would be added for each type of occurrence, this would lead to an increasingly wide table with empty cells.

After some discussion, we landed on the decision to just have a "cancerOccurances" collection for this use case.

JoeArmani commented 10 months ago

Perfect, thanks @brotzmanmj @jacobmpeters @FrogGirl1123.

Marie Josephe, I'm not sure how to @ you here. I'll be documenting this today and early tomorrow. Please let me know if there are any changes/clarifications/etc. needed, and I'm available on Teams. Thanks.

FrogGirl1123 commented 10 months ago

From Marie Josephe in email yesterday "Mia envisioned this as a way to confirm that the transmitted data is indeed on a cancer diagnosis, and not on a preliminary report of suspected cancer. It is duplicative, and will not be informative on our end; it is a memory aid for the sites. "

So 525972260 will remain as a question and in it's current location in the cancerOccurrence table documented here https://nih.app.box.com/file/1393159438842?s=3pemvj2oqylx4c1eez3n8oszqfmilf73.

JoeArmani commented 10 months ago

Perfect. Thanks @FrogGirl1123.

JoeArmani commented 10 months ago

Update: Development is moving forward as of 12/19/23.

Here's the link to the updated details and data structure: https://nih.app.box.com/file/1393159438842?s=3pemvj2oqylx4c1eez3n8oszqfmilf73. It details the POST request from sites and the resulting 'cancerOccurrence' data. I'll provide detailed documentation for sites upon completion.

Thanks to everyone for working together to get the updated requirements confirmed.

JoeArmani commented 10 months ago

PR ready for review: https://github.com/episphere/connectFaas/pull/496

Updated documentation: General technical info: https://nih.app.box.com/file/1393159438842 Examples for Sites: https://nih.app.box.com/file/1404701245622

jhflorey commented 9 months ago

Approved PR https://github.com/episphere/connectFaas/pull/496

anthonypetersen commented 9 months ago

Reviewed & Approved

JoeArmani commented 9 months ago

This is in dev and ready for testing. @cunnaneaq will be reaching out to sites for testing shortly.

cunnaneaq commented 9 months ago

Update on decisions made 1/18/24 via email (@FrogGirl1123 @brotzmanmj @mnataraj92 @JoeArmani)

@JoeArmani will add the technical specs when ready

JoeArmani commented 9 months ago

New requirements PR ready for review here: https://github.com/episphere/connectFaas/pull/509 I'll update the documentation shortly.

brotzmanmj commented 9 months ago

@JoeArmani @mnataraj92 @cunnaneaq Decision made 1/24/24 by Mia and Nico: Eliminate preliminary cancer stage (CID 457270069, RCAOcc_PrelimCancStg_v1r0) from RCA for the current release. (Further detail: We will add related variables to the RCA (including stage and death) in the next few release cycles after we are able to better understand the development of “suspense file” and how to capture the evolution of stage and likely other characteristics of the cancer).

JoeArmani commented 9 months ago

@brotzmanmj @mnataraj92 @cunnaneaq I'll have a PR in with these changes in a bit, will update when it's in dev and ready for testing.

JoeArmani commented 9 months ago

PR for review: https://github.com/episphere/connectFaas/pull/518

JoeArmani commented 9 months ago

This is in dev and ready for testing.

Updated documentation: General technical info: https://nih.app.box.com/file/1393159438842 Examples for Sites: https://nih.app.box.com/file/1404701245622

brotzmanmj commented 9 months ago

@cunnaneaq is moving forward with dev testing with Vijay.

For future reference, in addition to considerations of vital status and cancer stage mentioned above, we will need a decision on whether we need a variable for manual review to confirm that the participant has been informed of their diagnosis. This will be for a later prod release, not the current one.

cunnaneaq commented 9 months ago

Tracking the additional RCA variables under consideration for a future release here: https://github.com/episphere/connect/issues/866