Run a phenome scan (pheWAS, Mendelian randomisation (MR)-pheWAS etc.) in UK Biobank.
There are three components in this project:
R for parts 1 and 2 above. Tested with R-3.3.1-ATLAS. Phenome scan requires the R packages: optparse (V1.3.2), MASS (V7.3-45), lmtest (V0.9-34), nnet (V7.3-12), forestplot (V1.7) and data.table (V1.10.4).
Java for part 3 above. Tested with jdk-1.8.0-66.
Please cite:
Millard LAC, et al. Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. International Journal of Epidemiology (2017).
A phenome scan is run using WAS/phenomeScan.r
. This is ready to go. One amendment you may wish to make before running PHESANT is the TRAIT_OF_INTEREST column in the variable information file (see below).
The PHESANT phenome scan processing pipeline is illustrated in the figure here, and described in detail in the paper above.
The phenome scan is run with the following command:
cd WAS/
Rscript phenomeScan.r \
--phenofile=<phenotypesFilePath> \
--traitofinterestfile=<traitOfInterestFilePath> \
--variablelistfile="../variable-info/outcome-info.tsv" \
--datacodingfile="../variable-info/data-coding-ordinal-info.csv" \
--traitofinterest=<traitOfInterestName> \
--resDir=<resultsDirectoryPath> \
--userId=<userIdFieldName>
The following example runs part 1 of 20, of a sensitivity analysis phenome scan (adjusting for age, sex, and assessment centre, see below), using a non genetic trait of interest:
cd WAS/
Rscript phenomeScan.r \
--phenofile=<phenotypesFilePath> \
--traitofinterestfile=<traitOfInterestFilePath> \
--variablelistfile="../variable-info/outcome-info.tsv" \
--datacodingfile="../variable-info/data-coding-ordinal-info.csv" \
--traitofinterest=<traitOfInterestName> \
--resDir=<resultsDirectoryPath> \
--userId=<userIdFieldName> \
--sensitivity \
--genetic=FALSE \
--partIdx=1 \
--numParts=20
Arg | Description |
---|---|
phenofile | Comma separated file containing phenotypes. Each row is a participant, the first column contains the participant id and the remaining columns are phenotypes. Where there are multiple columns for a phenotype these must be adjacent in the file. Specifically for a given field in Biobank the instances should be adjacent and within each instance the arrays should be adjacent. Each variable name needs to be changed to the format 'x[varid]_[instance]_[array]' (we use the prefix 'x' so that the variable names are valid in R). |
variablelistfile | Tab separated file containing information about each phenotype, that is used to process them (see below). |
datacodingfile | Comma separated file containing information about data codings (see below). |
traitofinterest | Variable name as in traitofinterestfile. |
resDir | Directory where you want the results to be stored. |
Arg | Description |
---|---|
traitofinterestfile | Comma separated file containing the trait of interest (e.g. a snp, genetic risk score or observed phenotype). Each row is a participant and there should be two columns - the user ID and the trait of interest. Where this argument is not supplied, the trait of interest should be a column in the phenofile. |
confounderfile | Comma separated file containing the confounders, so that you can choose what confounders to use in the phenome scan. |
userId | User id column as in the traitofinterestfile and the phenofile (default: userId). |
partIdx | Subset of phenotypes you want to run (for parallelising). |
numParts | Number of subsets you are using (for parallelising). |
sensitivity | By default analyses are adjusted for age (field 21022), sex (field 31) and, if the genetic argument is set to TRUE , genotype chip (a binary variable derived from field 22000). If sensitivity argument is used (by including --sensitivity when running PHESANT) then analyses additionally adjust for the assessment centre (field 54). If sensitivity argument is used (by including --sensitivity when running PHESANT) and the genetic argument is set to TRUE , the first 10 genetic principal components (fields 22009_0_1 to 22009_0_10) are also included as confounders. If you wish to choose your own confounders to use in the phenome scan you can use the confounderfile option (described above). |
genetic | By default genetic=TRUE , and we assume the trait of interest is a genetic variable (e.g. a SNP or genetic risk score). If this is not the case (e.g you are running an environment-wide association study) then set this flag to FALSE. This option determines which variables are controlled for in analyses, see sensitivity arg above. |
save | Instead of running phenome scan, generated phenotypes are stored to file, in resDir . If this option is used then traitofinterest argument is not required. |
confidenceintervals | By default confidenceintervals=TRUE , but specifying confidenceintervals=FALSE means that PHESANT doesn't calculate the association confidence intervals (which may speed up PHESANT). |
standardise | By default standardise=TRUE , but specifying standardise=FALSE means that PHESANT will not standardise the exposure variable. E.g. use this option for binary exposure variables. |
tab | By default phenotype file (phenofile) is comma seperated, but tab=TRUE can be specified when your file is tab delimited (e.g. using the r option for UK Biobank's ukbconv utility). |
mincase | Minimum size of phenotype categories (default is 10). |
The numParts and partIdx arguments are both used to parallelise the phenome scan. E.g. setting numParts to 5 will divide the set of phenotypes into 5 (rough) parts and then partIdx can be used to call the phenome scan on a specific part (1-5).
Data codes define a set of values that can be assigned to a given field. A data code can be assigned to more than one variable, which is why we use a separate file describing the necessary information for each data code. For example, there are several fields about diet that have data code 100009.
The data coding file should have the following columns:
default_value_related_field
column below. This is used where a category is not explicitly stated in the field but
instead needs to be determined by looking at whether another field has a value. Typically, this occurs where there is no category for 'none' in a questionnaire field, because participants were told they did not have to mark 'none' but could instead leave it blank
(see for example section 5.3 in the 24 hour diet questionnaire manual). Hence, we assume that if they completed the questionnaire and have not ticked a value, then the value is 'none'. See default value example below.default_value
.In the data code information file we specify default_value=0
and default_value_related_field=20080
for data code 100006.
Field 100200, for example, has data code 100006.
Therefore all participants with a value for field 20080, but with no value in field 100200, are assigned value 0 for field 100200.
Intuitively, all participants who have answered the 24-hour recall diet questionnaire have a value in field 20080, and of these, we assume that those with no value for field 100200 have opted
for 'none' implicitly, by not ticking any option.
This file was initially the UK Biobank data dictionary, which can be downloaded from the UK Biobank website here. This data dictionary provides the following set of information about fields, used in this phenome scan tool:
The variable information file also has the following columns that we have added, to provide additional information used in the phenome scan:
v
in this categorical multiple field are simply the people with this particular value. However the negative values can be determined in three ways:
v
is assigned FALSE
, except those with a value denoting missingness (i.e. value is <0)). FALSE
to any participant with at least one value for this field, where these values do not include value v
, and also do not include a value denoting missingness (i.e. value is <0).fieldID
(assign FALSE
to any participant without value v
and without a value denoting missingness (i.e. value is <0) and with a value in this other field with ID fieldID
).In the directory specified with the resDir
argument, the following files will be created:
Where the phenome scan is run in parallel setup, then each parallel part will have one of each of the above files, with 'all' in each filename replaced with: [partIdx]-[numParts].
See testWAS/README.md for an example with test data.
If the save option is used, instead or producing results files, PHESANT will create the following files:
data-linear-all.txt
.data-log-all.txt
.The resultsProcessing folder provides code to post-process the results, specifically:
variablelistfile
file, to the results file.The results processing is run with the following command:
cd resultsProcessing/
Rscript mainCombineResults.r \
--resDir=<resultsDirectoryPath> \
--variablelistfile="../variable-info/outcome-info.tsv"
Arg | Description |
---|---|
resDir | Directory where the phenome scan results are stored. |
variablelistfile | Tab separated file containing information about each phenotype, that is used to process them. Same as variablelistfile used in main phenome scan. |
Arg | Description |
---|---|
numParts | Number of subsets (parts) you have used (for parallelising). |
See testWAS/README.md
for an example with test data.
The QQ plot contains the following elements:
A phenome scan generates a large number of results. The aim of this visualisation is to help with interpretation, by allowing the researcher to view each result in the context of the results of related traits.
See the PHESANT-viz folder and README therein for more information.