HannahVMeyer / PhenotypeSimulator

Other
28 stars 7 forks source link

Read in data from getCausalSNPs() #10

Closed blairzhang126 closed 5 years ago

blairzhang126 commented 5 years ago

Hi, Hannah,

Sorry to bother you again. I encountered a problem when reading data from getCausalSNPs command. My R version is 3.5.0. It seems that I must run the same command twice to get the correct format that I want it to be. To give you an idea what I have done:

library(PhenotypeSimulator) set.seed(1234) causalSNPsFromLines <- getCausalSNPs(10000,NrCausalSNPs=10,chr=22,genoFilePrefix="testtime",genoFileSuffix="_10000_case_maf0.01_ld0.2_100.recodeA.raw.transpose",format="delim",delimiter=" ")

First round output:

head(causalSNPsFromLines) X rs470766_C rs2329553_T rs8136076_T rs56150635_C 22-36398040_A ID_1 "id2_10000" "0" "1" "0" "0" "0"
ID_2 "id2_10001" "0" "0" "1" "0" "0"
ID_3 "id2_10002" "0" "1" "0" "0" "0"
ID_4 "id2_10003" "0" "1" "0" "0" "0"
ID_5 "id2_10004" "0" "1" "0" "0" "0"
ID_6 "id2_10005" "1" "1" "0" "0" "0"

Of which I don't want the quote and the id number twice. If I run the exact same command again, I got the correct format like this:

head(causalSNPsFromLines) rs481709_T rs2329553_T rs4821946_G rs5759481_G rs957648_C rs3876055_A ID_1 1 1 0 0 1 0 ID_2 0 0 0 0 1 1 ID_3 2 1 0 0 0 2 ID_4 2 1 0 0 2 0 ID_5 0 1 0 1 1 2 ID_6 1 1 0 0 0 0

I hate to run the exact same command twice to get what I want. Just wanted to know if you have encountered this before. I have uploaded my data to github in case you wanted to look at or test it: https://github.com/blairzhang126/phenosim-sampledata (it's a space-delimiter file.)

Best, Blair

blairzhang126 commented 5 years ago

Hi, Hannah, no worries! I found the problem!! I'm posting it here in case other people are wondering. (feel free to close the issue)

It turns out only random seed 1234 is the problem. I tried 123 and 12345, they both gave me correct format. That's the reason why I need to run twice to get the correct format because the random seed for the second time is not 1234 anymore. The one with random seed 1234 will not work!

I hope these are not confusing to people and feel free to comment if you have any.

HannahVMeyer commented 5 years ago

Hi Blair,

thank you for pointing me to this issue and providing the sample data!

The problem occurred because getCausalSNPs was not designed to handle a header line when sampling from a delimited file:

If format== delim, the first column in each file needs to be the SNP_ID and files cannot contain a header. (from Details in ?getCausalSNPs)

The seed you choose at random happened to lead to sampling of the first row in the file which was the header. I have now included an option to specify if the file contains a header and additional checks to make sure the right data is received when sampling from the genotypes file. This is available now on the current github version (v 0.3.2). I will keep this issue open till this fix is also up on CRAN.

Thank you for raising this issue, Hannah

HannahVMeyer commented 5 years ago

Latest release including this fix on CRAN (v0.3.3), closing this now.