AMP-SCZ / lochness

Download your data to a data lake.
Other
3 stars 1 forks source link

Recap data pull discussion #10

Open kcho opened 3 years ago

kcho commented 3 years ago

lochness.redcap is pulling all available data from the REDCap to a json file

Problems

1. Daily pull of the data for all subjects may put too much load on the REDCap server

2. extensive work is required on the logbook to select and extract the data from the json dump to visualize in the DPDash

Solutions

tashrifbillah commented 3 years ago

Hi @kcho and @sbouix , let's continue the discussion here.

By Kevin (edited by Tashrif):

I found a “Data Entry Trigger” function in Redcap. Whenever a record is modified or updated, it sends a POST signal with a bunch of information to a dedicated server. If the major problem in pulling all the data on a daily basis is the REDCap server overloading, do you think implementing “Data Entry Trigger” and connecting to lochness would be a solution (or overkill?)

image

Suggested workflow:

This would solve the REDCap server problem and we would be able to keep all of the up-to-date REDCap data in lochness.

tashrifbillah commented 3 years ago

Okay, here is my modified workflow:

The last three steps could be done by a cron like bot.

sbouix commented 3 years ago

To add the agenda the ability to detect tags for particular variables.

kcho commented 3 years ago

Thanks for this @tashrifbillah

Could you set up a url under the https://predict.bwh.harvard.edu/, so it can catch the POST signal from REDCap Data Entry Trigger please?

or if we have any other publically open ports among PNL servers, please let me know. I'll test getting the signal.

sbouix commented 3 years ago

The only 2 externally facing servers I know of are hcpep-xnat and our web server. Predict is behind the firewall.

tashrifbillah commented 3 years ago

Hi Kevin, do you know of a tutorial that I can go through to learn to upload a file to REDCap? I need to be able to upload, trigger, and listen independently to be able to set up such a thing. Also, where did you get the screenshot? If writing is hard, MS Teams call works for me.

tashrifbillah commented 3 years ago

Is this the function I need?

kcho commented 3 years ago

Hi Kevin, do you know of a tutorial that I can go through to learn to upload a file to REDCap? I need to be able to upload, trigger, and listen independently to be able to set up such a thing.

I have not uploaded a file before, but I would suggest to look at the api playground and try import file API Method. API doc is here: https://redcap.partners.org/redcap/api/help

Also, where did you get the screenshot?

Screenshot is from REDCAP - "Project Setup" -> "Enable optional modules and customizations"

kcho commented 3 years ago

Quickly tested to see if REDCap sends the signal to an open server.

are sent to the server. I think it can act as a very useful logging system.

I’ll bring this up in our next meeting, so we can discuss how we can including this.

redcap_url=https%3A%2F%2Fredcap.partners.org%2Fredcap%2F&project_url=https%3A%2F%2Fredcap.partners.org%2Fredcap%2Fredcap_v10.0.30%2Findex.php%3Fpid%3D26709&project_id=26709&username=kc244&record=100111111&instrument=adverse_events_ae&adverse_events_ae_complete=0
tashrifbillah commented 3 years ago
2. extensive work is required on the logbook to select and extract the data from the json dump to visualize in the 
DPDash

how many fields are there?

The HCP-EP survey I am working with has 915 fields in each of the six instruments a.k.a surveys.

will the fields be changed in any point of the study?

The fields are the same across the six instruments so they should be consistent across the study.

kcho commented 3 years ago

@sbouix @tashrifbillah I thought about the architecture below for what we have discussed yesterday about the REDCap data pulling. I think there were two main problems we discussed yesterday. One is PII and the other is server overloading. Below is my suggestion, please let me know what you think. I'll start working on them soon.

Proposed REDCap pulling architecture

PII part

  1. lochness.redcap pulls all data from REDCap server to PROTECTED/survey/raw/ABCD01.json

  2. Save json - data free from PII

    • lochness.redcap (or predict_pii.redcap or logbook.redcap)
    • from PROTECTED/survey/raw/ABCD01.json remove all PII fields
      • using REDCap tags "PII" (need to review how we can pull this information)
    • and save it in GENERAL/survey/raw/ABCD01.json
  3. Save another json - data with the PIIs replaced with pseudo-random strings

    • lochness.redcap (or predict_pii.redcap or logbook.redcap)
    • process PII fields in PROTECTED/survey/raw/ABCD01.json and save it in PROTECTED/survey/processed/ABCD01.json
    • copy PROTECTED/survey/raw/ABCD01.json to GENERAL/survey/processed/ABCD01.json

Redcap server overloading problem part

  1. before pulling any data from REDCap, lochness.redcap checks for files under PROTECTED/survery/raw

    • if there is ABCD01.json already
      • check for db, which is updated live by listening to the POST-SIGNAL from REDCap Data Entry Trigger
      • if ABCD01 is in the db, execute the download
      • if ABCD01 is not in the db, skip the download
  2. repeat PII part above

  3. in the lochness - lochness transfer, change of the ABCD01.json should be detected by sha1 / hash / other methods to only pull the updated data.

tashrifbillah commented 3 years ago

What is the distinction between points 2 and 3 under PII Part?

kcho commented 3 years ago

What is the distinction between points 2 and 3 under PII Part?

Sorry - edited a bit Point 2 is for saving a json in GENERAL - data that has no PII Point 3 is for saving a json in GENERAL - data that has the PII fields replaced to pseudo-random strings

sbouix commented 3 years ago

Let's concentrate on REDCap server overloading first.

The PII masking is more complex, some variables can be deleted (e.g. name), others replaced by another variable (e.g. birthdate -> age in years). I am not sure we should have two copies of pretty much the same thing (raw vs processed). Also because I would like to import the anonymized data into MGB REDCap, we should figure out how that will be affected by (2) vs (3). Finally, we may be better off having a table with a list of pii variables as input rather than try to extract the tag from REDCap.

sbouix commented 3 years ago

For lochness to lochness transfer. I also think datalad might be useful. Something to discuss with Chris and Mathias on Friday.

tashrifbillah commented 3 years ago

Hi @kcho , did you try making a workstation listen to REDCap signal yet? If you haven't, I can try that for my entertainment out of DPDash crisscross ;)

kcho commented 3 years ago

Hi @kcho , did you try making a workstation listen to REDCap signal yet? If you haven't, I can try that for my entertainment out of DPDash crisscross ;)

I haven't yet tried it in the workstation- but I've drafted a commandline tool and a module for listening to the POST signal from the redcap server in the lochness.redcap https://github.com/PREDICT-DPACC/lochness/blob/devel/kcho/redcap_new_arch/scripts/listen_to_redcap.py

kcho commented 3 years ago

Let's concentrate on REDCap server overloading first.

The model shown below has been uploaded to the devel/kcho/redcap_new_arch. https://github.com/PREDICT-DPACC/lochness/compare/master...PREDICT-DPACC:devel/kcho/redcap_new_arch

To do

Figure

image

Summary

1. Make a database from the POST signals from the REDCap Data Entry Trigger

timestamp project_id redcap_username record instrument
1617823322.701979 26709 kc244 subject0002 inclusionexclusion
1617823322.711633 26709 kc244 subject0001 inclusionexclusion

2. lochness.redcap checks for any updates in the Data Entry Trigger database before executing datapull

tashrifbillah commented 3 years ago

In

check DET-DB
recent update

Do you plan to compare checksum like mediaflux does? Here are nipype ways of computing checksum:

kcho commented 3 years ago

In

check DET-DB
recent update

Do you plan to compare checksum like mediaflux does? Here are nipype ways of computing checksum:

Since the Data Entry Trigger Database (DET-DB) is a CSV file containing all the REDCap field updates and the timestamp of each POST signal, I compare the last modified date of the already existing json file vs last update captured in the DET-DB for each subject (if this subject exists in the DET-DB)

tashrifbillah commented 3 years ago

Hi @kcho , is it expecting an empty csv file?

kcho commented 3 years ago

Hi @kcho , is it expecting an empty csv file?

It's expecting the path of the DET-DB csv file. If the already csv exists, the live capture server will append new information to the existing csv file.

tashrifbillah commented 3 years ago

Currently, how is it being programmed--listen_to_redcap.py running sync.py --source redcap sort of?

kcho commented 3 years ago

Currently, the two python scripts have to be executed separately. Just realized it could be useful to design following your comment.

listen_to_redcap.py running sync.py --source redcap sort of?

Any downside to doing this? Programatically, how would you spin out sync.py continuously running while also continuously running the listen_to_redcap.py from the single execution? multiprocess module?

tashrifbillah commented 3 years ago

multiprocess module?

It should be a chanied process--trigger comes first and then pull. We shall discuss more during our Monday brainstorming session.

By the way, do we have access to @sbouix 's presentation on what data reside in what platforms? I am trying to understand which platforms should trigger data entry signals. I understand for PRoNET, it would be REDCap. What would that be for PRESCIENT?

sbouix commented 3 years ago

The primary database system for PRESCIENT will be RPMS (Research Project Management System). It is custom built by the Orygen team and doesn't have the extensive documentation or API functionalities of REDCap. We're working to get access to their IT infrastructure to setup a development environment and start developing the Lochness RPMS module.