Input file formats for impute_runner.py

ccrobertson commented 4 years ago

Hi there,

Thanks for a great tool! I've run all the test examples successfully and am now working on getting this to run on my own data set.

In preparing the input data for impute_runner.py, I noticed some small inconsistencies in input files formats expected by preprocess_data.py. The kinship file (--king) is expected to be tab-delimited, while the age and sex file (--agesex) is expected to be space delimited. I was trying to use tab-delimited files for both and got the error "KeyError: IID" because the agesex file was not being parsed properly.

It might be worth mentioning in the documentation specifically what the requirements for these two files are. Or changing both to expect whitespace or tab-delimited files.

I edited the following command in preprocess_data.py and now it's running (line 83) agesex = pd.read_csv(agesex_address, sep = " ") --> agesex = pd.read_csv(agesex_address, sep = "\t")

Also, could specify which version of KING the program expects. I am using KING version 2.2 and the output from running KING with the --related --degree 1 options is a kinship file (.kin) with the following columns: FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 HetConc HomIBS0 Kinship IBD1SegIBD2Seg PropIBD InfType Error

This is different from the headers in the sample.king file you have in the example data sets. I used an awk command to reformat to match the example data and it is now working (I think): awk 'BEGIN {OFS="\t"; print "FID1","ID1","FID2","ID2","InfType"} NR>1 {print $1, $2, $1, $3, $15}' ${snipar_out}/${prefix}.kin > ${snipar_out}/${prefix}.king

MoeenNehzati commented 4 years ago

Hi,

In preparing the input data for impute_runner.py, I noticed some small inconsistencies in input files formats expected by preprocess_data.py. The kinship file (--king) is expected to be tab-delimited, while the age and sex file (--agesex) is expected to be space delimited. I was trying to use tab-delimited files for both and got the error "KeyError: IID" because the agesex file was not being parsed properly.

We wanted the format of agesex file to be similar to pedigree files. That's why it's not tab-delimited. We may change it to tab delimitered in the future. You are right about the documentation, I'll add it to the documentation as soon as possible.

This is different from the headers in the sample.king file you have in the example data sets.

You should pass the file containing IBD segments to impute_runner.py as the IBD file. You can obtain that by running:

king -b [address of the bed file for whole genome]\
 --ibdseg \
 --degree 2

Hope that this solves the problem.

ccrobertson commented 4 years ago

Great thanks.

Sorry, I thought there were two king output files supplied to impute_runner.py: IBD file (generated using king --ibdseg) kinship file (generated using king --related --degree 1)

The IBD file generated when I run KING matches the example data (sample.segments.gz file). But the kinship file generated when I run KING 2.2.5 does not match the example (sample.king).

MoeenNehzati commented 3 years ago

integrate with git

AlexTISYoung / snipar

Input file formats for impute_runner.py #6