bmvdgeijn / WASP

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery
Apache License 2.0
103 stars 51 forks source link

Incorrect matching during Step2 of CHT #89

Closed katiearacena closed 5 years ago

katiearacena commented 5 years ago

Hello,

I am on Step 2 of CHT using snp2h5. All chromosomes except chr3-9 match correctly. However, for chromosomes 3-9 the following happens:

"chr3.phasing.impute2.gz pairs with chr2.phasing.impute2_haps.gz" "chr4.phasing.impute2.gz pairs with chr2.phasing.impute2_haps.gz" "chr5.phasing.impute2.gz pairs with chr2.phasing.impute2_haps.gz" .........

Chromosomes 3-9 impute2.gz always pairs with chr2.phasing.impute2_haps.gz. Chromosomes 1-2 and 10-22 pair correctly.

This incorrect matching also occurs during the "initializing HDF5 matrix" stage:

"guessing chromosome name from filename chr3.phasing.impute2.gz best matching chromosome: 2". This results in the "ERROR: snp2h5.c:479 failed to create dataset".

Any insights into what might be going on would be greatly appreciated!!!!

gmcvicker commented 5 years ago

This is a bug that I fixed with the VCF parsing a while ago. It was more difficult to fix with the impute parsing and I didn't get around to fixing it there, but I should. I believe that the issue is that it is matching the '2' in the impute2 with the chr2 in the other filename. As a workaround could you try removing the impute2 part from your filenames? E.g. you could name them like chr3.phasing.impute.gz (dropping the '2').

katiearacena commented 5 years ago

Hi Graham,

Thanks for your quick response. I tried to drop the 2 from the file name but it appears that it requires the 'impute2' extension. I get the error:

"WARNING: snp2h5.c:637: ignoring file 'chr*.phasing.impute_haps.gz'. Expected extension '.impute2.gz' or 'impute2_haps.gz' done"

Is this something that I can fix on my end?

Thank you!

katiearacena commented 5 years ago

Hi Graham,

Do you have any other ideas on how I might circumvent this problem? Thanks!

Katie

gmcvicker commented 5 years ago

Hi Katie,

The problem is that the chromosome matching code has trouble when the chromosomes are named like '1', '2', '3', instead of 'chr1', 'chr2', 'chr3'. It is matching the '2' in the impute2 filename instead of the chromosome name.

I would like to fix this for the impute files the same way i fixed it for VCF parsing (by peaking in the files to determine the chromosome, rather than relying on the filename) but I don't have time to implement a fix for this right now due to a deadline.

As a quick fix you could try the following. (1) change the snp2h5 code to match impute.gz and impute2_haps.gz files instead by deleting the '2' on the following lines of code: https://github.com/bmvdgeijn/WASP/blob/36c0e5f8b523278ba2a1f1ffbe17a7785b79a6e6/snp2h5/snp2h5.c#L627-L631 (2) re-make snp2h5 (3) rename your input filenames as I suggested before so they are like 'impute.gz' and 'impute_haps.gz' (4) re-run snp2h5

katiearacena commented 5 years ago

Hi Graham,

Thank you for your very clear quick-fix suggestion. I was able to resolve the issue with this work around. Thank you!