Closed yudaleng closed 10 months ago
Hi @yudaleng, thanks for reaching out. Are you asking about the code to preprocess raw UKB spirometry data into flow-volume curves?
We have an example of converting the UKB demo spirometry blow showcased in field 3066 here: https://github.com/Google-Health/genomics-research/blob/main/regle/generate_ukb_3066_demo_dataset.py. This spirometry exhalation volume curve example is publicly available and can be downloaded following instructions on this page: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3. Accessing the full dataset requires application and approval from the UK Biobank.
Note that the final formats for REGLE differ slightly from those used in ML-based COPD (npy files v.s. a pickled pandas dataframe, respectively). The dataframe returned by derive_input_representations
contains inputs in the expected ML-based COPD format:
trimmed_records_df = trim_records(UKB_3066_RECORDS)
base_curve_dfs = derive_base_curves(trimmed_records_df)
blow_curve_derived_df = derive_input_representations(base_curve_dfs)
If doing training, one would need to concatenate additional label columns to that dataframe. Separate train, validation, and test pickles should then be written for each split according to the expected filepaths outlined in https://github.com/Google-Health/genomics-research/blob/7d8a8aff79b490657dfe66845ae60f4a9c2a000d/ml-based-copd/learning/dataset_util.py#L82.
Does that answer your question?
Yes, I have UKB access. Thank you very much for your reply!
I have previously used the generate_ukb_3066_demo_dataset.py
you provided to process the 3066 field. I now believe that there is an issue with my label (copd) processing and would like to ask if you can provide the processing files for the label columns?
Have a nice day
Great, I'm glad that generate_ukb_3066_demo_dataset.py
worked for preprocessing the spirograms!
Can you describe the issues you are running into when generating your COPD labels in more detail? Unfortunately our label generation code is tightly integrated with our internal infrastructure for ingesting and storing raw UKB data so much of it won't be immediately useful if open-sourced. Depending on which columns you are having trouble with, I might be able to pull out a relevant or useful snippet for reference or describe how the column was derived in more detail.
We have our binary COPD label definitions outlined in Supplementary Table 1 (web link, direct pdf link). I've reproduced it here for reference:
Label | Definition | Usage |
---|---|---|
Self report | Code 6 in field 6152 (medical conditions in touch screen questionnaires) or codes 1112, 1113, or 1472 in field 20002 (medical conditions in verbal interview). | Definition of medical-record-based COPD labels below used in training. |
Primary Care | ICD-10 codes J41, J42, J43, or J44 in field 42040 (GP clinical event records), after Read v2 and v3 codes in the records mapped into ICD-10 codes. | Definition of medical-record-based COPD labels below used in training. |
Training Hospitalization | Includes primary or secondary causes of hospitalization. ICD-9 codes 491, 492, or 496 in field 41271 (Diagnoses - ICD9), or ICD-10 codes J41, J42, J43, or J44 in field 41270 (Diagnoses - ICD10). | Definition of medical-record-based COPD labels below used in training. |
Medical-record-based | If a COPD case in at least one of "self report", "primary care", and "training hospitalization" COPD labels. | Training of ML models. |
Evaluation medical-record-based | Logical OR of "self report", "primary care", and "training hospitalization" labels only when all three sources exist for an individual. | Evaluation of ML models and GWAS hits. Having all three sources increases the likelihood of a correct COPD label which is preferred for evaluation. |
Future hospitalization | Only includes cases with COPD as primary cause of hospitalization after the spirometry test date. ICD-10 codes J41, J42, J43, or J44 in field 41234 (records in HES inpatient diagnoses dataset) after converting ICD-9 codes to ICD-10 codes. | Evaluation of ML models. |
Death | ICD-10 J41, J42, J43, or J44 codes in field 40001 (primary cause of death). | Evaluation of ML models and GWAS hits. |
Hospitalization | Similar to "future hospitalization" but also includes cases before the spirometry test date. | Evaluation of GWAS hits. |
Proxy-GOLD | Mirrors moderate or worse GOLD grading for a single blow without bronchodilation: FEV1/FVC<0.7 and FEV1%pred<80% . |
Evaluating the noisiness of training labels, "medical-record-based" COPD. Evaluation of binarized ML-based COPD liability. |
Assuming you have extracted self report, primary care, and hospitalization label columns for all individuals into a copd_binary_df
pandas dataframe as copd_sr_src
, copd_gp_src
, and copd_hesin_src
, respectively, we then define the medical-record-based COPD and evaluation medical-record-based COPD as follows:
import numpy as np
import pandas as pd
def float_logical_or(df: pd.DataFrame, columns: list[str]) -> pd.Series:
# Assumes the values are 0.0, 1.0, or NaN. If all columns are NaN,
# the result is NaN; otherwise, it is the max of non-NaN values, which is
# equivalent to the logical or for 0/1 values.
return df[columns].max(axis=1)
# Define medical-record-based COPD as the OR of self report, primary care, and
# hospitalization labels.
copd_binary_df['copd_mrb'] = float_logical_or(
copd_binary_df,
[
'copd_sr_src',
'copd_gp_src',
'copd_hesin_src',
],
)
# Define evaluation medical-record-based COPD as the OR of self report, primary
# care, and hospitalization labels over individuals with non-missing values for
# all three sources.
has_all_srcs_mask = (
copd_binary_df[[
'copd_sr_src',
'copd_gp_src',
'copd_hesin_src',
]]
.notna()
.all(axis=1)
)
copd_binary_df['copd_mrb_eval'] = np.where(
has_all_srcs_mask,
copd_binary_df['copd_mrb'],
np.nan,
)
Sincerely thank you again for your reply!
This is my code to process the copd label, according to the form, after processing the label will be unified in the copd column. I would like to ask you if I have processed it correctly:
data['copd_sr_src'] = 0
data['copd_gp_src'] = 0
data['copd_hesin_src'] = 0
data['copd'] = 0
# Self report
cols_6152 = data.filter(like='6152').columns
data.loc[data[cols_6152].eq(6).any(axis=1), 'copd_sr_src'] = 1
cols_20002 = data.filter(like='20002').columns
data.loc[data[cols_20002].isin([1112, 1113, 1472]).any(axis=1), 'copd_sr_src'] = 1
# Primary Care
cols_42040 = data.filter(like='42040').columns
data.loc[data[cols_42040].isin(['J41', 'J42', 'J43', 'J44']).any(axis=1), 'copd_gp_src'] = 1
# Training Hospitalization
cols_41271 = data.filter(like='41271').columns
cols_41270 = data.filter(like='41270').columns
data.loc[
data[cols_41271].isin(['491', '492', '496']).any(axis=1) | data[cols_41270].isin(['J41', 'J42', 'J43', 'J44']).any(
axis=1), 'copd_hesin_src'] = 1
def float_logical_or(df: pd.DataFrame, columns: list[str]) -> pd.Series:
# Assumes the values are 0.0, 1.0, or NaN. If all columns are NaN,
# the result is NaN; otherwise, it is the max of non-NaN values, which is
# equivalent to the logical or for 0/1 values.
return df[columns].max(axis=1)
# Define medical-record-based COPD as the OR of self report, primary care, and
# hospitalization labels.
data['copd_mrb'] = float_logical_or(
data,
[
'copd_sr_src',
'copd_gp_src',
'copd_hesin_src',
],
)
# Define evaluation medical-record-based COPD as the OR of self report, primary
# care, and hospitalization labels over individuals with non-missing values for
# all three sources.
has_all_srcs_mask = (
data[[
'copd_sr_src',
'copd_gp_src',
'copd_hesin_src',
]]
.notna()
.all(axis=1)
)
data['copd'] = np.where(
has_all_srcs_mask,
data['copd_mrb'],
np.nan,
)
Also, I would like to ask you a question. Following your paper, I took the raw data (n=453,558) and processed it and discarded some of the data:
# if one of the FEV1, FVC, and PEF values is in the extreme tail/top 0.5% of the observations
low_quantile = 0.005 # tail 0.5%
high_quantile = 0.995 # top 0.5%
times_list = ['0.0', '0.1', '0.2']
for time in tqdm(times_list):
fvc_name = '3062-' + time
fev1_name = '3063-' + time
pef_name = '3064-' + time
fev1_low, fev1_high = data[fev1_name].quantile([low_quantile, high_quantile])
fvc_low, fvc_high = data[fvc_name].quantile([low_quantile, high_quantile])
pef_low, pef_high = data[pef_name].quantile([low_quantile, high_quantile])
mask = (
(data[fev1_name] < fev1_low) | (data[fev1_name] > fev1_high) |
(data[fvc_name] < fvc_low) | (data[fvc_name] > fvc_high) |
(data[pef_name] < pef_low) | (data[pef_name] > pef_high)
)
data.loc[mask, 'eid'] = np.nan
data = data.dropna(subset=['eid'])
# we deem a blow valid if the value recorded in Field 3061 is 0 or 32
field_id_list = ['3061']
for field_id in field_id_list:
for time in tqdm(times_list):
handle_name = field_id + '-' + time
data_handle = data[['eid', handle_name]]
data_handle = data_handle.dropna(subset=[handle_name])
data_handle[handle_name] = pd.to_numeric(data_handle[handle_name], errors='coerce')
discard_eid_list = data_handle.loc[~data_handle[handle_name].isin([0, 32]), 'eid'].tolist()
for eid in discard_eid_list:
change_name = "3066" + "-" + time
data.loc[data['eid'] == eid, change_name] = np.nan
data = data.dropna(subset=['eid'])
I set the problematic data to nan and uniformly remove the nan data at the end. I ended up with the processed data (n=361344)
After doing that, I did the final processing with the code I gave you earlier for processing copd labels(labels0:355287 ;labels1:6057 ) and generate_ukb_3066_demo_dataset.py
.
At this point all I have left in my dataframe is eid
, time
, volume_pad_last
, flow_pad_zero
, flow_volume_pad_zero
, copd
. I divided it into 80% training set(n=289075) as well as 20% testing set(n=72269) by train_test_split(data, test_size=0.2, stratify=data['copd'], random_state=42)
. Finally I packaged both the training set and test set data into pkl files.
I used time
, volume_pad_last
, flow_pad_zero
for training and modified resnet18_fv_copd.py
.
config.dataset_config = ml_collections.ConfigDict({
'inputs': {
'eid',
'time',
'volume_pad_last',
'flow_pad_zero',
},
# The rest of the code remains unchanged
})
config.backbone_config = ml_collections.ConfigDict({
'class_name': 'ResNet18',
'kwargs': {
'model_name': 'ResNet18',
'input_names': 'time,volume_pad_last,flow_pad_zero',
# The rest of the code is unchanged
}
})
I got overfitting when I got to about 70 rounds of training and ran poor predictions, I know it must be my problem but I don't really know what part I am doing wrong and I would appreciate if you could correct me!
This is my code to process the copd label, according to the form, after processing the label will be unified in the copd column. I would like to ask you if I have processed it correctly
While the combination of sources looks correct, it's challenging to debug the definition for each source without knowing how your data
dataframe is structured. As a quick sanity check, does your final label prevalence match that from
Supplementary Table 36 (web link, direct pdf link)?
Dataset | Split | $n$ | MRB | Eval. MRB | GOLD | Hospitalization | Death |
---|---|---|---|---|---|---|---|
Modeling | Train | 259746 | 0.0383 | 0.0473 | 0.0725 | 0.0075 | 0.0007 |
Modeling | Validation | 65281 | 0.0388 | 0.0479 | 0.0720 | 0.0072 | 0.0007 |
Fold 1 | Train | 128739 | 0.0384 | 0.0486 | 0.0711 | 0.0076 | 0.0007 |
Fold 1 | Validation | 32310 | 0.0391 | 0.0477 | 0.0719 | 0.0073 | 0.0008 |
Fold 2 | Train | 129691 | 0.0379 | 0.0456 | 0.0734 | 0.0072 | 0.0007 |
Fold 2 | Validation | 32637 | 0.0381 | 0.0480 | 0.0714 | 0.0069 | 0.0006 |
PRS | Holdout | 110739 | 0.0640 | 0.0748 | - | 0.0053 | 0.0021 |
Similarly, can you compare your distribution of COPD ICD10 codes to Supplementary Table 35? Based on "labels(labels0:355287 ;labels1:6057 )", it looks like your prevalence is lower than expected. If those do not match, it's worth checking that you are capturing subcodes (i.e., 491.*
, 492.*
, 496.*
for ICD 9) and that you have mapped the UKB gp_clinical
reads using the TRUD mappings (https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1834 and https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1835).
I've included a breakdown down of sample size and prevalence per label source below:
label | split | #cases | #samples | prevalence | % samples w/label |
---|---|---|---|---|---|
copd_mrb | ALL | 13104 | 351152 | 0.0373 | 1.0000 |
copd_sr_src | ALL | 5638 | 351040 | 0.0161 | 0.9997 |
copd_gp_src | ALL | 2934 | 161154 | 0.0182 | 0.4589 |
copd_hesin_src | ALL | 8512 | 289280 | 0.0294 | 0.8238 |
copd_mrb_eval | ALL | 6227 | 133987 | 0.0465 | 0.3816 |
copd_mrb | TRAIN | 9936 | 259746 | 0.0383 | 1.0000 |
copd_sr_src | TRAIN | 4285 | 259707 | 0.0165 | 0.9998 |
copd_gp_src | TRAIN | 2261 | 120425 | 0.0188 | 0.4636 |
copd_hesin_src | TRAIN | 6477 | 214743 | 0.0302 | 0.8267 |
copd_mrb_eval | TRAIN | 4749 | 100505 | 0.0473 | 0.3869 |
copd_mrb | VAL | 2533 | 65281 | 0.0388 | 1.0000 |
copd_sr_src | VAL | 1075 | 65276 | 0.0165 | 0.9999 |
copd_gp_src | VAL | 571 | 30373 | 0.0188 | 0.4653 |
copd_hesin_src | VAL | 1657 | 53776 | 0.0308 | 0.8238 |
copd_mrb_eval | VAL | 1212 | 25281 | 0.0479 | 0.3873 |
Here, label
is the target label column, split
is the dataset split, #case
is the total number of cases, #samples
is the total number of samples with a non-missing label, prevalence
is the prevalence of the label, and % samples w/label
is the fraction of individuals with a non-missing label (i.e., either a definitive case/control label is present). Note that the ALL
split contains all individuals with valid blows regardless of ancestry while the TRAIN
and VAL
splits contain only European samples.
Your code snippet for discarding tail values looks reasonable. Here is what we used:
import numpy as np
import pandas as pd
# The upper and lower bounds for extreme values.
LOWERBOUND_PERCENTAGE = 0.5
UPPERBOUND_PERCENTAGE = 99.5
def discard_extreme_tail_values(
df: pd.DataFrame,
measures: list[str],
) -> pd.DataFrame:
"""Discards extreme tail values for `measures`.
This function assumes that each measure has three indices corresponding to
each blow attempt (e.g., 'fev1_0', 'fev1_1', annd 'fev1_2' are the FEV1
measures for an individual across the three blow attempts from the first
visit).
Args:
df: The dataframe to discard tail values for.
measures: The measures to discard tail values for.
Returns:
The dataframe with extreme tail values discarded.
"""
df = df.copy()
lb, ub = {}, {}
for measure in measures:
values = []
for idx in [0, 1, 2]:
values.extend(list(df[f'{measure}_{idx}'].dropna().values))
lb[measure], ub[measure] = np.percentile(
values, q=[LOWERBOUND_PERCENTAGE, UPPERBOUND_PERCENTAGE]
)
for idx in [0, 1, 2]:
na_mask = pd.Series(False, index=df.index)
columns = []
for measure in measures:
column = f'{measure}_{idx}'
columns.append(column)
na_mask |= (df[column] < lb[measure]) | (df[column] > ub[measure])
df.loc[na_mask, columns] = np.nan
return df
spirometry_df = discard_extreme_tail_values(
spirometry_df,
['fev1', 'fvc', 'pef'],
)
# Remove unacceptable values according to Field 3061:
# https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=3061.
for idx in [0, 1, 2]:
unacceptable_mask = (spirometry_df[f'qc_{idx}'] != 0) & (
spirometry_df[f'qc_{idx}'] != 32
)
spirometry_df.loc[
unacceptable_mask, [f'fvc_{idx}', f'fev1_{idx}', f'pef_{idx}']
] = np.nan
Your splitting procedure sounds reasonable. For configs, you want to only train using the flow-volume curve:
config.dataset_config = ml_collections.ConfigDict({
'inputs': {
'eid',
'flow_volume_pad_zero',
},
# The rest of the code remains unchanged
})
config.backbone_config = ml_collections.ConfigDict({
'class_name': 'ResNet18',
'kwargs': {
'model_name': 'ResNet18',
'input_names': 'flow_volume_pad_zero',
# The rest of the code is unchanged
}
})
For 70 rounds of training, is that 70 iterations/steps/batches or 70 epochs? I'd expect overfitting to occur well before 70 epochs and at around 15-20k steps.
Thanks for your reply !
(https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1834 and https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1835)
Can you provide the label handling code for Training Hospitalization? Or give me a simple example? I downloaded the TRUD mappings you told me about and would like to know how to use it I would like to know how you captured the subcode and processed it using TRUD mapping. Have a nice day !
This is unfortunately one of the things that is challenging to extract cleanly from our infra and is dependent on how you've stored your UKB datasets. At a high level, you want to do the following:
coding1834
from [1]) and Read3 ( coding1835
from [2]) coding TRUD mappings.{eid: has_copd_gp_src}
mapping denoting whether each EID has a COPD-related event.copd_gp_src
column. Individuals who have COPD events are considered cases, individuals who have only non-COPD events are considered controls, and individuals with no events are considered missing (i.e., nan
). Hopefully this helps.
[1] https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1834 [2] https://biobank.ndph.ox.ac.uk/ukb/coding.cgi?id=1835 [3] https://biobank.ndph.ox.ac.uk/ukb/rectab.cgi?id=1060
Hi @yudaleng, I've summarized the details from this discussion in the README: https://github.com/Google-Health/genomics-research/tree/main/ml-based-copd#dataset-preprocessing. I also uploaded the spirometry preprocessing code from the REGLE repository to a separate library here: https://github.com/Google-Health/genomics-research/blob/main/ml-based-copd/learning/ukb_3066_demo_preprocessing.py. I'm going to mark this as resolved for now---please feel free to open another issue and assign it to me if you have further questions on label derivation.
Thank you very much for your help!
Hi, can you provide the following preprocessing code for ml-based-copd? Thanks!