HealthRex / CDSS

Clinical decision support system by HealthRex laboratory at Stanford University.
GNU General Public License v3.0
101 stars 41 forks source link

SHC_Core_2023 + LPCH_Core_2023 data updates #399

Closed jonc101 closed 3 months ago

jonc101 commented 7 months ago

Migrate over this data from Research IT to our secure compute databases. [ ] Rename the datasets/tables to remove the shc and lpch prefixes, so that the naming convention matches prior years (instead just store them in separate shc_core_2023 and lpch_core_2023 databases/datasets) [ ] Add UTC version of all datetimes [ ] Extract numerical values from flowsheets

jyx-su commented 7 months ago

Just finished the first renaming. Here's the code I used for generating the SQL commands

`!bq ls shc_core2023 | grep '^ shc' > tables_to_rename.txt

prefix = 'shc_'

with open('tables_to_rename.txt' ,'r') as f: for line in f: table_name = line.strip().split(' ')[0].strip()

print(table_name)

    print(f'ALTER TABLE shc_core_2023.{table_name} RENAME TO {table_name[len(prefix):]};')`
fatemeh91 commented 7 months ago

The second renaming is finished. The code snippet utilized for this purpose has been added as Item 2 under the

https://github.com/HealthRex/CDSS/blob/master/setup/BigQueryDataUpdateGuide.MD

jonc101 commented 7 months ago

Great. Rather than put all the code in the issue, keep it in the actual code base and just link to the files from here.

Will help us find and collate when we need to do this again in the next iteration.

Thanks,

Jonathan

From: Fatemeh Amrollahi @.> Sent: Friday, February 9, 2024 7:55 PM To: HealthRex/CDSS @.> Cc: jonc101 @.>; Author @.> Subject: Re: [HealthRex/CDSS] SHC_Core_2023 + LPCH_Core_2023 data updates (Issue #399)

The second renaming is finished. Here is the code snippet that I used.

from google.cloud import bigquery; from google.cloud.bigquery import dbapi; import subprocess import json import pdb import sys

def All_TimeStamp_Columns(table_name,dataset_name='shc_core_2023',project_id='som-nero-phi-jonc101'):

extract all the columns of table with timestamp

command = f'bq show --schema --format=prettyjson {project_id}:{dataset_name}.{table_name}' process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = process.communicate()

command was sucessful

if process.returncode == 0:

    datetimescols=[]
    schema = json.loads(stdout)
    for field in schema:
           if field['type'] == 'DATETIME':

                   datetimescols.append(field['name'])
    return datetimescols

else: print(f'Error running bq command: {stderr.decode("utf-8")}') return None

def All_Available_tables(dataset_name='shc_core_2023',project_id='som-nero-phi-jonc101'):

get all the table names aviable in dataset

command = f'bq ls --format=json {project_id}:{dataset_name}' process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = process.communicate()

command was sucessful

if process.returncode == 0: table_names=[] schema = json.loads(stdout) for field in schema: if field['type'] == 'TABLE': table_names.append(field['tableReference']['tableId']) return table_names else: print(f'Error running bq command: {stderr.decode("utf-8")}') return None

def Time_to_UTC(columns,table_name,dataset_name='shc_core_2023',project_id='som-nero-phi-jonc101'):

def Auth(project_id='som-nero-phi-jonc101'):
client = bigquery.Client(project_id); 

return client

client=Auth() conn = dbapi.connect(client); cursor = conn.cursor(); new_table_name=table_name

datetime_col=','.join(map(str, columns))

query=f'CREATE OR REPLACE TABLE {project_id}.{dataset_name}.{new_table_name} AS SELECT * '

add UTC col

for col in columns: new_col=col+'_utc' query=query+f",\n timestamp(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',timestamp({col})), 'America/Los_Angeles') as {new_col}"

query+=f'\n from {project_id}.{dataset_name}.{table_name};'

cursor.execute(query); results = cursor.fetchall(); for row in results: print( row ); cursor.close() conn.close()

if name == 'main': project_id='som-nero-phi-jonc101' dataset_name='shc_core_2023' table_names=All_Available_tables(dataset_name,project_id)

for table in table_names: datetimecolumns=All_TimeStamp_Columns(table_name=table,dataset_name=dataset_name,project_id=project_id) if len(datetimecolumns)>0: Time_to_UTC(columns=datetimecolumns,table_name=table,dataset_name=dataset_name,project_id=project_id)

— Reply to this email directly, view it on GitHub https://github.com/HealthRex/CDSS/issues/399#issuecomment-1936846025 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AHU2RW7SOQEXEJVCGNJSBCTYS3VQZAVCNFSM6AAAAABDAPTSDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWHA2DMMBSGU . You are receiving this because you authored the thread.Message ID: @.***>

jonc101 commented 3 months ago

[ ] Separate out lpch_core2023 Looks like a bunch of "lpch*" tables are currently under the shc_core2023 dataset/database. These are the Lucile Packhard Children's Hospital data, and should be separate (just like they were labeled as shc** tables before).

Move these out to a separate "lpch_core_2023" dataset/database to match