Google-Health / genomics-research

122 stars 32 forks source link

ml-based-copd preprocessing code #9

Closed yudaleng closed 10 months ago

yudaleng commented 10 months ago

Hi, can you provide the following preprocessing code for ml-based-copd? Thanks!

justincosentino commented 10 months ago

Hi @yudaleng, thanks for reaching out. Are you asking about the code to preprocess raw UKB spirometry data into flow-volume curves?

We have an example of converting the UKB demo spirometry blow showcased in field 3066 here: https://github.com/Google-Health/genomics-research/blob/main/regle/generate_ukb_3066_demo_dataset.py. This spirometry exhalation volume curve example is publicly available and can be downloaded following instructions on this page: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3. Accessing the full dataset requires application and approval from the UK Biobank.

Note that the final formats for REGLE differ slightly from those used in ML-based COPD (npy files v.s. a pickled pandas dataframe, respectively). The dataframe returned by derive_input_representations contains inputs in the expected ML-based COPD format:

trimmed_records_df = trim_records(UKB_3066_RECORDS)
base_curve_dfs = derive_base_curves(trimmed_records_df)
blow_curve_derived_df = derive_input_representations(base_curve_dfs)

If doing training, one would need to concatenate additional label columns to that dataframe. Separate train, validation, and test pickles should then be written for each split according to the expected filepaths outlined in https://github.com/Google-Health/genomics-research/blob/7d8a8aff79b490657dfe66845ae60f4a9c2a000d/ml-based-copd/learning/dataset_util.py#L82.

Does that answer your question?

yudaleng commented 10 months ago

Yes, I have UKB access. Thank you very much for your reply! I have previously used the generate_ukb_3066_demo_dataset.py you provided to process the 3066 field. I now believe that there is an issue with my label (copd) processing and would like to ask if you can provide the processing files for the label columns? Have a nice day

justincosentino commented 10 months ago

Great, I'm glad that generate_ukb_3066_demo_dataset.py worked for preprocessing the spirograms!

Can you describe the issues you are running into when generating your COPD labels in more detail? Unfortunately our label generation code is tightly integrated with our internal infrastructure for ingesting and storing raw UKB data so much of it won't be immediately useful if open-sourced. Depending on which columns you are having trouble with, I might be able to pull out a relevant or useful snippet for reference or describe how the column was derived in more detail.

We have our binary COPD label definitions outlined in Supplementary Table 1 (web link, direct pdf link). I've reproduced it here for reference:

Label Definition Usage
Self report Code 6 in field 6152 (medical conditions in touch screen questionnaires) or codes 1112, 1113, or 1472 in field 20002 (medical conditions in verbal interview). Definition of medical-record-based COPD labels below used in training.
Primary Care ICD-10 codes J41, J42, J43, or J44 in field 42040 (GP clinical event records), after Read v2 and v3 codes in the records mapped into ICD-10 codes. Definition of medical-record-based COPD labels below used in training.
Training Hospitalization Includes primary or secondary causes of hospitalization. ICD-9 codes 491, 492, or 496 in field 41271 (Diagnoses - ICD9), or ICD-10 codes J41, J42, J43, or J44 in field 41270 (Diagnoses - ICD10). Definition of medical-record-based COPD labels below used in training.
Medical-record-based If a COPD case in at least one of "self report", "primary care", and "training hospitalization" COPD labels. Training of ML models.
Evaluation medical-record-based Logical OR of "self report", "primary care", and "training hospitalization" labels only when all three sources exist for an individual. Evaluation of ML models and GWAS hits. Having all three sources increases the likelihood of a correct COPD label which is preferred for evaluation.
Future hospitalization Only includes cases with COPD as primary cause of hospitalization after the spirometry test date. ICD-10 codes J41, J42, J43, or J44 in field 41234 (records in HES inpatient diagnoses dataset) after converting ICD-9 codes to ICD-10 codes. Evaluation of ML models.
Death ICD-10 J41, J42, J43, or J44 codes in field 40001 (primary cause of death). Evaluation of ML models and GWAS hits.
Hospitalization Similar to "future hospitalization" but also includes cases before the spirometry test date. Evaluation of GWAS hits.
Proxy-GOLD Mirrors moderate or worse GOLD grading for a single blow without bronchodilation: FEV1/FVC<0.7 and FEV1%pred<80%. Evaluating the noisiness of training labels, "medical-record-based" COPD. Evaluation of binarized ML-based COPD liability.

Assuming you have extracted self report, primary care, and hospitalization label columns for all individuals into a copd_binary_df pandas dataframe as copd_sr_src, copd_gp_src, and copd_hesin_src, respectively, we then define the medical-record-based COPD and evaluation medical-record-based COPD as follows:

import numpy as np
import pandas as pd

def float_logical_or(df: pd.DataFrame, columns: list[str]) -> pd.Series:
  # Assumes the values are 0.0, 1.0, or NaN. If all columns are NaN,
  # the result is NaN; otherwise, it is the max of non-NaN values, which is
  # equivalent to the logical or for 0/1 values.
  return df[columns].max(axis=1)

# Define medical-record-based COPD as the OR of self report, primary care, and
# hospitalization labels.
copd_binary_df['copd_mrb'] = float_logical_or(
    copd_binary_df,
    [
        'copd_sr_src',
        'copd_gp_src',
        'copd_hesin_src',
    ],
)

# Define evaluation medical-record-based COPD as the OR of self report, primary
# care, and hospitalization labels over individuals with non-missing values for
# all three sources.
has_all_srcs_mask = (
    copd_binary_df[[
        'copd_sr_src',
        'copd_gp_src',
        'copd_hesin_src',
    ]]
    .notna()
    .all(axis=1)
)
copd_binary_df['copd_mrb_eval'] = np.where(
    has_all_srcs_mask,
    copd_binary_df['copd_mrb'],
    np.nan,
)
yudaleng commented 10 months ago

Sincerely thank you again for your reply!

This is my code to process the copd label, according to the form, after processing the label will be unified in the copd column. I would like to ask you if I have processed it correctly:

data['copd_sr_src'] = 0
data['copd_gp_src'] = 0
data['copd_hesin_src'] = 0
data['copd'] = 0

# Self report
cols_6152 = data.filter(like='6152').columns
data.loc[data[cols_6152].eq(6).any(axis=1), 'copd_sr_src'] = 1

cols_20002 = data.filter(like='20002').columns
data.loc[data[cols_20002].isin([1112, 1113, 1472]).any(axis=1), 'copd_sr_src'] = 1

# Primary Care
cols_42040 = data.filter(like='42040').columns
data.loc[data[cols_42040].isin(['J41', 'J42', 'J43', 'J44']).any(axis=1), 'copd_gp_src'] = 1

# Training Hospitalization
cols_41271 = data.filter(like='41271').columns
cols_41270 = data.filter(like='41270').columns

data.loc[
    data[cols_41271].isin(['491', '492', '496']).any(axis=1) | data[cols_41270].isin(['J41', 'J42', 'J43', 'J44']).any(
        axis=1), 'copd_hesin_src'] = 1

def float_logical_or(df: pd.DataFrame, columns: list[str]) -> pd.Series:
    # Assumes the values are 0.0, 1.0, or NaN. If all columns are NaN,
    # the result is NaN; otherwise, it is the max of non-NaN values, which is
    # equivalent to the logical or for 0/1 values.
    return df[columns].max(axis=1)

# Define medical-record-based COPD as the OR of self report, primary care, and
# hospitalization labels.
data['copd_mrb'] = float_logical_or(
    data,
    [
        'copd_sr_src',
        'copd_gp_src',
        'copd_hesin_src',
    ],
)

# Define evaluation medical-record-based COPD as the OR of self report, primary
# care, and hospitalization labels over individuals with non-missing values for
# all three sources.
has_all_srcs_mask = (
    data[[
        'copd_sr_src',
        'copd_gp_src',
        'copd_hesin_src',
    ]]
    .notna()
    .all(axis=1)
)
data['copd'] = np.where(
    has_all_srcs_mask,
    data['copd_mrb'],
    np.nan,
)

Also, I would like to ask you a question. Following your paper, I took the raw data (n=453,558) and processed it and discarded some of the data:

# if one of the FEV1, FVC, and PEF values is in the extreme tail/top 0.5% of the observations
low_quantile = 0.005  # tail 0.5%
high_quantile = 0.995  # top 0.5%
times_list = ['0.0', '0.1', '0.2']
for time in tqdm(times_list):
    fvc_name = '3062-' + time
    fev1_name = '3063-' + time
    pef_name = '3064-' + time

    fev1_low, fev1_high = data[fev1_name].quantile([low_quantile, high_quantile])
    fvc_low, fvc_high = data[fvc_name].quantile([low_quantile, high_quantile])
    pef_low, pef_high = data[pef_name].quantile([low_quantile, high_quantile])
    mask = (
            (data[fev1_name] < fev1_low) | (data[fev1_name] > fev1_high) |
            (data[fvc_name] < fvc_low) | (data[fvc_name] > fvc_high) |
            (data[pef_name] < pef_low) | (data[pef_name] > pef_high)
    )
    data.loc[mask, 'eid'] = np.nan
data = data.dropna(subset=['eid'])

# we deem a blow valid if the value recorded in Field 3061 is 0 or 32
field_id_list = ['3061']
for field_id in field_id_list:
    for time in tqdm(times_list):
        handle_name = field_id + '-' + time
        data_handle = data[['eid', handle_name]]
        data_handle = data_handle.dropna(subset=[handle_name])
        data_handle[handle_name] = pd.to_numeric(data_handle[handle_name], errors='coerce')
        discard_eid_list = data_handle.loc[~data_handle[handle_name].isin([0, 32]), 'eid'].tolist()
        for eid in discard_eid_list:
            change_name = "3066" + "-" + time
            data.loc[data['eid'] == eid, change_name] = np.nan
data = data.dropna(subset=['eid'])

I set the problematic data to nan and uniformly remove the nan data at the end. I ended up with the processed data (n=361344)

After doing that, I did the final processing with the code I gave you earlier for processing copd labels(labels0:355287 ;labels1:6057 ) and generate_ukb_3066_demo_dataset.py. At this point all I have left in my dataframe is eid, time, volume_pad_last, flow_pad_zero, flow_volume_pad_zero, copd. I divided it into 80% training set(n=289075) as well as 20% testing set(n=72269) by train_test_split(data, test_size=0.2, stratify=data['copd'], random_state=42). Finally I packaged both the training set and test set data into pkl files.

I used time, volume_pad_last, flow_pad_zero for training and modified resnet18_fv_copd.py.

config.dataset_config = ml_collections.ConfigDict({
        'inputs': {
            'eid',
            'time',
            'volume_pad_last',
            'flow_pad_zero',
        },
        # The rest of the code remains unchanged
})

config.backbone_config = ml_collections.ConfigDict({
        'class_name': 'ResNet18',
        'kwargs': {
            'model_name': 'ResNet18',
            'input_names': 'time,volume_pad_last,flow_pad_zero',
            # The rest of the code is unchanged
        }
})

I got overfitting when I got to about 70 rounds of training and ran poor predictions, I know it must be my problem but I don't really know what part I am doing wrong and I would appreciate if you could correct me!

justincosentino commented 10 months ago

This is my code to process the copd label, according to the form, after processing the label will be unified in the copd column. I would like to ask you if I have processed it correctly

While the combination of sources looks correct, it's challenging to debug the definition for each source without knowing how your data dataframe is structured. As a quick sanity check, does your final label prevalence match that from Supplementary Table 36 (web link, direct pdf link)?

Dataset Split $n$ MRB Eval. MRB GOLD Hospitalization Death
Modeling Train 259746 0.0383 0.0473 0.0725 0.0075 0.0007
Modeling Validation 65281 0.0388 0.0479 0.0720 0.0072 0.0007
Fold 1 Train 128739 0.0384 0.0486 0.0711 0.0076 0.0007
Fold 1 Validation 32310 0.0391 0.0477 0.0719 0.0073 0.0008
Fold 2 Train 129691 0.0379 0.0456 0.0734 0.0072 0.0007
Fold 2 Validation 32637 0.0381 0.0480 0.0714 0.0069 0.0006
PRS Holdout 110739 0.0640 0.0748 - 0.0053 0.0021

Similarly, can you compare your distribution of COPD ICD10 codes to Supplementary Table 35? Based on "labels(labels0:355287 ;labels1:6057 )", it looks like your prevalence is lower than expected. If those do not match, it's worth checking that you are capturing subcodes (i.e., 491.*, 492.*, 496.* for ICD 9) and that you have mapped the UKB gp_clinical reads using the TRUD mappings (https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1834 and https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1835).

I've included a breakdown down of sample size and prevalence per label source below:

label split #cases #samples prevalence % samples w/label
copd_mrb ALL 13104 351152 0.0373 1.0000
copd_sr_src ALL 5638 351040 0.0161 0.9997
copd_gp_src ALL 2934 161154 0.0182 0.4589
copd_hesin_src ALL 8512 289280 0.0294 0.8238
copd_mrb_eval ALL 6227 133987 0.0465 0.3816
copd_mrb TRAIN 9936 259746 0.0383 1.0000
copd_sr_src TRAIN 4285 259707 0.0165 0.9998
copd_gp_src TRAIN 2261 120425 0.0188 0.4636
copd_hesin_src TRAIN 6477 214743 0.0302 0.8267
copd_mrb_eval TRAIN 4749 100505 0.0473 0.3869
copd_mrb VAL 2533 65281 0.0388 1.0000
copd_sr_src VAL 1075 65276 0.0165 0.9999
copd_gp_src VAL 571 30373 0.0188 0.4653
copd_hesin_src VAL 1657 53776 0.0308 0.8238
copd_mrb_eval VAL 1212 25281 0.0479 0.3873

Here, label is the target label column, split is the dataset split, #case is the total number of cases, #samples is the total number of samples with a non-missing label, prevalence is the prevalence of the label, and % samples w/label is the fraction of individuals with a non-missing label (i.e., either a definitive case/control label is present). Note that the ALL split contains all individuals with valid blows regardless of ancestry while the TRAIN and VAL splits contain only European samples.

Your code snippet for discarding tail values looks reasonable. Here is what we used:

import numpy as np
import pandas as pd

# The upper and lower bounds for extreme values.
LOWERBOUND_PERCENTAGE = 0.5
UPPERBOUND_PERCENTAGE = 99.5

def discard_extreme_tail_values(
    df: pd.DataFrame,
    measures: list[str],
) -> pd.DataFrame:
  """Discards extreme tail values for `measures`.

  This function assumes that each measure has three indices corresponding to
  each blow attempt (e.g., 'fev1_0', 'fev1_1', annd 'fev1_2' are the FEV1
  measures for an individual across the three blow attempts from the first
  visit).

  Args:
    df: The dataframe to discard tail values for.
    measures: The measures to discard tail values for.

  Returns:
    The dataframe with extreme tail values discarded.
  """
  df = df.copy()
  lb, ub = {}, {}
  for measure in measures:
    values = []
    for idx in [0, 1, 2]:
      values.extend(list(df[f'{measure}_{idx}'].dropna().values))
    lb[measure], ub[measure] = np.percentile(
        values, q=[LOWERBOUND_PERCENTAGE, UPPERBOUND_PERCENTAGE]
    )
  for idx in [0, 1, 2]:
    na_mask = pd.Series(False, index=df.index)
    columns = []
    for measure in measures:
      column = f'{measure}_{idx}'
      columns.append(column)
      na_mask |= (df[column] < lb[measure]) | (df[column] > ub[measure])
    df.loc[na_mask, columns] = np.nan
  return df

spirometry_df = discard_extreme_tail_values(
    spirometry_df,
    ['fev1', 'fvc', 'pef'],
)

# Remove unacceptable values according to Field 3061:
# https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=3061.
for idx in [0, 1, 2]:
  unacceptable_mask = (spirometry_df[f'qc_{idx}'] != 0) & (
      spirometry_df[f'qc_{idx}'] != 32
  )
  spirometry_df.loc[
      unacceptable_mask, [f'fvc_{idx}', f'fev1_{idx}', f'pef_{idx}']
  ] = np.nan

Your splitting procedure sounds reasonable. For configs, you want to only train using the flow-volume curve:

config.dataset_config = ml_collections.ConfigDict({
        'inputs': {
            'eid',
            'flow_volume_pad_zero',
        },
        # The rest of the code remains unchanged
})

config.backbone_config = ml_collections.ConfigDict({
        'class_name': 'ResNet18',
        'kwargs': {
            'model_name': 'ResNet18',
            'input_names': 'flow_volume_pad_zero',
            # The rest of the code is unchanged
        }
})

For 70 rounds of training, is that 70 iterations/steps/batches or 70 epochs? I'd expect overfitting to occur well before 70 epochs and at around 15-20k steps.

yudaleng commented 10 months ago

Thanks for your reply !

(https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1834 and https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1835)

Can you provide the label handling code for Training Hospitalization? Or give me a simple example? I downloaded the TRUD mappings you told me about and would like to know how to use it I would like to know how you captured the subcode and processed it using TRUD mapping. Have a nice day !

justincosentino commented 10 months ago

This is unfortunately one of the things that is challenging to extract cleanly from our infra and is dependent on how you've stored your UKB datasets. At a high level, you want to do the following:

  1. Download the the Read2 (coding1834 from [1]) and Read3 ( coding1835 from [2]) coding TRUD mappings.
  2. Parse these to create coding->meaning maps (where coding is the Read2 and Read3 codings and meaning is the ICD10 code). For example, there are a number of Read3 codes ("H310.", "H3101", "H310z", etc.) that map to the COPD "J41" code. You want to do this for all ICD 10 codes ('J41', 'J42', 'J43', and 'J44').
  3. Load your GP clinical events data TSV [3].
  4. Iterate over each line in the GP clinical data TSV (which contains EID, data provider, date, Read2, Read3, etc. columns), checking whether each event corresponds to one of the COPD-related ICD 10 codes. Build up an {eid: has_copd_gp_src} mapping denoting whether each EID has a COPD-related event.
  5. Convert this mapping into your copd_gp_src column.

Individuals who have COPD events are considered cases, individuals who have only non-COPD events are considered controls, and individuals with no events are considered missing (i.e., nan). Hopefully this helps.

[1] https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1834 [2] https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1835 [3] https://biobank.ndph.ox.ac.uk/ukb/rectab.cgi?id=1060

justincosentino commented 10 months ago

Hi @yudaleng, I've summarized the details from this discussion in the README: https://github.com/Google-Health/genomics-research/tree/main/ml-based-copd#dataset-preprocessing. I also uploaded the spirometry preprocessing code from the REGLE repository to a separate library here: https://github.com/Google-Health/genomics-research/blob/main/ml-based-copd/learning/ukb_3066_demo_preprocessing.py. I'm going to mark this as resolved for now---please feel free to open another issue and assign it to me if you have further questions on label derivation.

yudaleng commented 10 months ago

Thank you very much for your help!