kcho commented 3 years ago

`lochness.redcap` is pulling all available data from the REDCap to a json file

when the lochness.redcap.sync is re-executed, lochness pulls the whole data again and then compares existing json before overwriting.

Problems

1. Daily pull of the data for all subjects may put too much load on the `REDCap` server

Do we know what is the limit of API data pull? eg) 1GB in a week?
How big will the json file be for a subject be?

2. extensive work is required on the `logbook` to select and extract the data from the `json dump` to visualize in the `DPDash`

how many fields are there?
will the fields be changed in any point of the study?

Solutions

Add a function in lochness.redcap to pull only specific fields?
- Justin & Habib's suggestion
Add a function that pulls the field that shows the date of last edit (Do we have such field in REDCap?)
- if this field is different from the downloaded json -> redownload all files
- if this field is the same as from the downloaded json -> skip
Include 'redcap completed' column in the metadata.csv to stop or make the pulling less often.

tashrifbillah commented 3 years ago

Hi @kcho and @sbouix , let's continue the discussion here.

By Kevin (edited by Tashrif):

I found a “Data Entry Trigger” function in Redcap. Whenever a record is modified or updated, it sends a POST signal with a bunch of information to a dedicated server. If the major problem in pulling all the data on a daily basis is the REDCap server overloading, do you think implementing “Data Entry Trigger” and connecting to lochness would be a solution (or overkill?)

Suggested workflow:

Redcap Record gets updated
“Data entry trigger” sends the name of the data field updated to AWS
On the AWS, the list of updated records is stored
Lochness pulls this information from the AWS
Only download the updated fields

This would solve the REDCap server problem and we would be able to keep all of the up-to-date REDCap data in lochness.

tashrifbillah commented 3 years ago

Okay, here is my modified workflow:

REDCap record is updated
Data Entry Trigger emits a signal
Our very own https://predict.bwh.harvard.edu/ hosted watchdog (TBD) catches the signal
The watchdog (TBD) determines whether the update is an essential one
If yes, asks lochness to pull the updated record

The last three steps could be done by a cron like bot.

sbouix commented 3 years ago

To add the agenda the ability to detect tags for particular variables.

kcho commented 3 years ago

Thanks for this @tashrifbillah

Could you set up a url under the https://predict.bwh.harvard.edu/, so it can catch the POST signal from REDCap Data Entry Trigger please?

or if we have any other publically open ports among PNL servers, please let me know. I'll test getting the signal.

sbouix commented 3 years ago

The only 2 externally facing servers I know of are hcpep-xnat and our web server. Predict is behind the firewall.

tashrifbillah commented 3 years ago

Hi Kevin, do you know of a tutorial that I can go through to learn to upload a file to REDCap? I need to be able to upload, trigger, and listen independently to be able to set up such a thing. Also, where did you get the screenshot? If writing is hard, MS Teams call works for me.

tashrifbillah commented 3 years ago

Is this the function I need?

kcho commented 3 years ago

Hi Kevin, do you know of a tutorial that I can go through to learn to upload a file to REDCap? I need to be able to upload, trigger, and listen independently to be able to set up such a thing.

I have not uploaded a file before, but I would suggest to look at the api playground and try import file API Method. API doc is here: https://redcap.partners.org/redcap/api/help

Also, where did you get the screenshot?

Screenshot is from REDCAP - "Project Setup" -> "Enable optional modules and customizations"

kcho commented 3 years ago

Quickly tested to see if REDCap sends the signal to an open server.

Project id
Username
Record ID
Name of instrument modified

are sent to the server. I think it can act as a very useful logging system.

I’ll bring this up in our next meeting, so we can discuss how we can including this.

redcap_url=https%3A%2F%2Fredcap.partners.org%2Fredcap%2F&project_url=https%3A%2F%2Fredcap.partners.org%2Fredcap%2Fredcap_v10.0.30%2Findex.php%3Fpid%3D26709&project_id=26709&username=kc244&record=100111111&instrument=adverse_events_ae&adverse_events_ae_complete=0

tashrifbillah commented 3 years ago

2. extensive work is required on the logbook to select and extract the data from the json dump to visualize in the 
DPDash

how many fields are there?

The HCP-EP survey I am working with has 915 fields in each of the six instruments a.k.a surveys.

will the fields be changed in any point of the study?

The fields are the same across the six instruments so they should be consistent across the study.

kcho commented 3 years ago

@sbouix @tashrifbillah I thought about the architecture below for what we have discussed yesterday about the REDCap data pulling. I think there were two main problems we discussed yesterday. One is PII and the other is server overloading. Below is my suggestion, please let me know what you think. I'll start working on them soon.

Proposed `REDCap` pulling architecture

PII part

lochness.redcap pulls all data from REDCap server to PROTECTED/survey/raw/ABCD01.json
Save json - data free from PII
- lochness.redcap (or predict_pii.redcap or logbook.redcap)
- from PROTECTED/survey/raw/ABCD01.json remove all PII fields
  - using REDCap tags "PII" (need to review how we can pull this information)
- and save it in GENERAL/survey/raw/ABCD01.json
Save another json - data with the PIIs replaced with pseudo-random strings
- lochness.redcap (or predict_pii.redcap or logbook.redcap)
- process PII fields in PROTECTED/survey/raw/ABCD01.json and save it in PROTECTED/survey/processed/ABCD01.json
- copy PROTECTED/survey/raw/ABCD01.json to GENERAL/survey/processed/ABCD01.json

Redcap server overloading problem part

before pulling any data from REDCap, lochness.redcap checks for files under PROTECTED/survery/raw
- if there is ABCD01.json already
  - check for db, which is updated live by listening to the POST-SIGNAL from REDCap Data Entry Trigger
  - if ABCD01 is in the db, execute the download
  - if ABCD01 is not in the db, skip the download
repeat PII part above
in the lochness - lochness transfer, change of the ABCD01.json should be detected by sha1 / hash / other methods to only pull the updated data.

tashrifbillah commented 3 years ago

What is the distinction between points 2 and 3 under PII Part?

kcho commented 3 years ago

What is the distinction between points 2 and 3 under PII Part?

Sorry - edited a bit Point 2 is for saving a json in GENERAL - data that has no PII Point 3 is for saving a json in GENERAL - data that has the PII fields replaced to pseudo-random strings

sbouix commented 3 years ago

Let's concentrate on REDCap server overloading first.

The PII masking is more complex, some variables can be deleted (e.g. name), others replaced by another variable (e.g. birthdate -> age in years). I am not sure we should have two copies of pretty much the same thing (raw vs processed). Also because I would like to import the anonymized data into MGB REDCap, we should figure out how that will be affected by (2) vs (3). Finally, we may be better off having a table with a list of pii variables as input rather than try to extract the tag from REDCap.

sbouix commented 3 years ago

For lochness to lochness transfer. I also think datalad might be useful. Something to discuss with Chris and Mathias on Friday.

tashrifbillah commented 3 years ago

Hi @kcho , did you try making a workstation listen to REDCap signal yet? If you haven't, I can try that for my entertainment out of DPDash crisscross ;)

kcho commented 3 years ago

Hi @kcho , did you try making a workstation listen to REDCap signal yet? If you haven't, I can try that for my entertainment out of DPDash crisscross ;)

I haven't yet tried it in the workstation- but I've drafted a commandline tool and a module for listening to the POST signal from the redcap server in the lochness.redcap https://github.com/PREDICT-DPACC/lochness/blob/devel/kcho/redcap_new_arch/scripts/listen_to_redcap.py

kcho commented 3 years ago

Let's concentrate on REDCap server overloading first.

The model shown below has been uploaded to the devel/kcho/redcap_new_arch. https://github.com/PREDICT-DPACC/lochness/compare/master...PREDICT-DPACC:devel/kcho/redcap_new_arch

To do

test in PNL workstation
record a demo
discuss the consequences of the Data Entry Trigger (DET) capture server going down

Figure

Summary

1. Make a database from the POST signals from the REDCap `Data Entry Trigger`

listen_to_recap.py: live server that captures and saves all the POST signals received from REDCap Data Entry Trigger
- saves a table looks like below

timestamp	project_id	redcap_username	record	instrument
1617823322.701979	26709	kc244	subject0002	inclusionexclusion
1617823322.711633	26709	kc244	subject0001	inclusionexclusion

The path of the DB above entered into config.yml

2. `lochness.redcap` checks for any updates in the `Data Entry Trigger` database before executing datapull

lochness.redcap.get_data_entry_trigger_df: loads the DET database
lochness.redcap.check_if_modified: compares st_mtime of already saved jsons vs DET database for any recent updates

tashrifbillah commented 3 years ago

In

check DET-DB
recent update

Do you plan to compare checksum like mediaflux does? Here are nipype ways of computing checksum:

kcho commented 3 years ago

In
check DET-DB
recent update
Do you plan to compare checksum like mediaflux does? Here are nipype ways of computing checksum:

shortcut https://github.com/nipy/nipype/blob/6c060304f380c46b2f05c5afdc7171dbbdfadc58/nipype/utils/filemanip.py#L212

detailed https://github.com/nipy/nipype/blob/6c060304f380c46b2f05c5afdc7171dbbdfadc58/nipype/utils/filemanip.py#L179

Since the Data Entry Trigger Database (DET-DB) is a CSV file containing all the REDCap field updates and the timestamp of each POST signal, I compare the last modified date of the already existing json file vs last update captured in the DET-DB for each subject (if this subject exists in the DET-DB)

tashrifbillah commented 3 years ago

Hi @kcho , is it expecting an empty csv file?

kcho commented 3 years ago

Hi @kcho , is it expecting an empty csv file?

It's expecting the path of the DET-DB csv file. If the already csv exists, the live capture server will append new information to the existing csv file.

tashrifbillah commented 3 years ago

Currently, how is it being programmed--listen_to_redcap.py running sync.py --source redcap sort of?

kcho commented 3 years ago

Currently, the two python scripts have to be executed separately. Just realized it could be useful to design following your comment.

listen_to_redcap.py running sync.py --source redcap sort of?

Any downside to doing this? Programatically, how would you spin out sync.py continuously running while also continuously running the listen_to_redcap.py from the single execution? multiprocess module?

tashrifbillah commented 3 years ago

multiprocess module?

It should be a chanied process--trigger comes first and then pull. We shall discuss more during our Monday brainstorming session.

By the way, do we have access to @sbouix 's presentation on what data reside in what platforms? I am trying to understand which platforms should trigger data entry signals. I understand for PRoNET, it would be REDCap. What would that be for PRESCIENT?

sbouix commented 3 years ago

The primary database system for PRESCIENT will be RPMS (Research Project Management System). It is custom built by the Orygen team and doesn't have the extensive documentation or API functionalities of REDCap. We're working to get access to their IT infrastructure to setup a development environment and start developing the Lochness RPMS module.

AMP-SCZ / lochness

Recap data pull discussion #10

`lochness.redcap` is pulling all available data from the REDCap to a json file

Problems

1. Daily pull of the data for all subjects may put too much load on the `REDCap` server

2. extensive work is required on the `logbook` to select and extract the data from the `json dump` to visualize in the `DPDash`

Solutions

Proposed `REDCap` pulling architecture

PII part

Redcap server overloading problem part

Figure

Summary

1. Make a database from the POST signals from the REDCap `Data Entry Trigger`

2. `lochness.redcap` checks for any updates in the `Data Entry Trigger` database before executing datapull

AMP-SCZ / lochness

Recap data pull discussion #10

lochness.redcap is pulling all available data from the REDCap to a json file

Problems

1. Daily pull of the data for all subjects may put too much load on the REDCap server

2. extensive work is required on the logbook to select and extract the data from the json dump to visualize in the DPDash

Solutions

Proposed REDCap pulling architecture

PII part

Redcap server overloading problem part

Figure

Summary

1. Make a database from the POST signals from the REDCap Data Entry Trigger

2. lochness.redcap checks for any updates in the Data Entry Trigger database before executing datapull

`lochness.redcap` is pulling all available data from the REDCap to a json file

1. Daily pull of the data for all subjects may put too much load on the `REDCap` server

2. extensive work is required on the `logbook` to select and extract the data from the `json dump` to visualize in the `DPDash`

Proposed `REDCap` pulling architecture

1. Make a database from the POST signals from the REDCap `Data Entry Trigger`

2. `lochness.redcap` checks for any updates in the `Data Entry Trigger` database before executing datapull