MRCIEU / GwasDataImport

R package to upload data to GWAS database
https://mrcieu.github.io/GwasDataImport/
Other
3 stars 1 forks source link

Cannot specify columns in the dataset #1

Open mightyphil2000 opened 3 years ago

mightyphil2000 commented 3 years ago

I'm getting this error message when I try to specify columns in the dataset:

x$determine_columns(list(chr_col="CHR", snp_col="rs", pos_col="BP", oa_col="other_allele", ea_col="eff_allele", eaf_col="Fctrl", beta_col="lnor", se_col="SE", pval_col="P",ncase_col = "ncases",ncontrol_col="nctrls","imp_info_col"="RsqAvg"))

Error in x$determine_columns(list(chr_col = "CHR", snp_col = "rs", pos_col = "BP", : all(is.numeric(out$beta)) is not TRUE In addition: Warning message: Unknown or uninitialised column: beta.

I've checked at the beta column is definitely all numeric.

If I specify the columns using column position I get a different error:

x$determine_columns(list(chr_col=3, snp_col=2, pos_col=4, oa_col=6, ea_col=5, eaf_col=12, beta_col=16, se_col=8, pval_col=9)) Error in .subset2(x, i, exact = exact) : subscript out of bounds

mvab commented 3 years ago

Hi,

I have the same issue all(is.numeric(out$pval)) is not TRUE

This is how I specify the columns:

x$determine_columns(list(chr_col="CHR", 
                         snp_col="SNP", 
                         pos_col="BP",
                         oa_col="ALLELE0",
                         ea_col="ALLELE1", 
                         eaf_col="A1FREQ", 
                         beta_col="BETA", 
                         se_col="SE",
                         pval_col="P_BOLT_LMM_INF"))

Here it seems to be assigning the columns correctly:

Checking alleles are in A/C/T/G/D/I
0 variants with disallowed characters
Is this how the dataset should look?
tibble [100 × 9] (S3: tbl_df/tbl/data.frame)
 $ chr : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
 $ pos : int [1:100] 10177 10352 11008 11012 13110 13116 13118 13273 14464 14599 ...
 $ ea  : chr [1:100] "A" "T" "C" "C" ...
 $ oa  : chr [1:100] "AC" "TA" "G" "G" ...
 $ beta: num [1:100] 0.003867 -0.000167 -0.003125 -0.003125 -0.001727 ...
 $ se  : num [1:100] 0.00408 0.00419 0.00701 0.00701 0.00929 ...
 $ pval: num [1:100] 0.34 0.97 0.66 0.66 0.85 0.79 0.79 0.12 0.79 0.95 ...
 $ snp : chr [1:100] "rs367896724" "rs201106462" "rs575272151" "rs544419019" ...
 $ eaf : num [1:100] 0.602 0.607 0.914 0.914 0.941 ...
NULL

I think something is happening with the column order in the format function. In my file, the column order is not the same as input arguments in determine_columns (understandably), so I specify them by column name (as above). This leads to the error. However, if I re-order the columns in my original file to match the order of the arguments in the format_dataset function and save it as a new file, and then try to run format_dataset on this file, it works fine.

column order in the original file: "CHR" , "BP", "SNP" , "BETA" , "SE" , "ALLELE1" , "ALLELE0", "A1FREQ" , "P_BOLT_LMM_INF" reordered: "CHR", "SNP", "BP", "ALLELE0", "ALLELE1", "A1FREQ", "BETA", "SE", "P_BOLT_LMM_INF"

for both I run the same x$determine_columns as above.

So reordering the file before trying to upload is a workaround for now.

mvab commented 3 years ago

Hi @explodecomputer,

I think I found what is causing this issue (ignore my above investigation).

In determine_columns(), files in the format of IEU GWAS pipeline output are being read okay when rows=100 is specified (example 1). However, when rows=Inf (inside format_dataset() function) it reads the pval column as <chr>, not as <dbl> (example2). I'm not sure why this happens.

(example1) $ P_BOLT_LMM <dbl> 0.400, 0.940, 0.740, 0.740, 0.790, 0.960,

(example2) $ P_BOLT_LMM <chr> "4.0E-01", "9.4E-01", "7.4E-01", "7.4E-01"

So the is.numeric() check fails.

My suggestions:

or