LD Matrix File Format - Githubissues

1667857557 commented 1 year ago

Dear Dr.Yang

Thank you for your exceptional work. While preparing the input file for summary-level GWAS fine-mapping, I encountered a problem. The input LD correlation matrix file in the CARMA package demo is in .txt.gz format. However, the downloaded LD matrix file from polyfun for UK10k data is in .gz and .npz formats. Could you please provide some guidance on how to convert these file types? Additionally, I'd like to know if the CARMA package can perform full GWAS fine-mapping, as conducting separate analyses seems a bit inconvenient. Thanks in advance for your assistance.

Huang

ZikunY commented 1 year ago

Dear Huang,

Sorry for the delay response... And thank you for taking an interest in CARMA.

For the input LD correlation matrix, I believe the formats of whether txt.gz or .gz or .npz only affect the way of loading data into R? As long as the LD matrix could be properly loaded in R, then stored in a list variable of R, it would be the same for using CARMA here. Note that the LD matrix here is assumed to be Pearson correlation, not the r-square LD, which is the default LD for many other software, such as plink.

For now, CARMA can only be used at locus-level, where the risky loci are usually identified by the GWAS.

Please let me know if you have any questions.

Best, Zikun

1667857557 commented 1 year ago

Dear Zikun,

Thanks for your reply, The LD matrix download file from UK10K consists of definite intervals, each spanning 1,000,000 bp in chromosome position. These intervals are provided in both .npz and .gz file formats, such as "chr1_3000001_6000001.gz" and "chr1_3000001_6000001.npz" or "chr1_4000001_7000001.gz" and "chr1_4000001_7000001.npz." To clarify, the .npz file contains the LD matrix without column and row names, while the corresponding .gz file contains the rsid values, which serve as the column and row names for the .npz file.

My question pertains to building pairwise LD matrices for a known locus. For instance, consider the position of the brca1 gene at chr17:41196312-41277381. However, the available LD matrix from UK10K covers the broader region of chr17:40000000-50000000. I'm unsure whether it's appropriate to directly utilize the UK10K LD matrix for this purpose, or if there's a method to extract the pairwise matrix specifically for the locus of interest.

I appreciate your assistance in advance.

Sincerely, Huang

ZikunY commented 1 year ago

Hi Huang,

If you only need the LD for a subset of variants, you still need to load the full files, but you can subset the variant table after data import. Their .gz file is a variant table that includes all the variants in the LD matrix (npz file). The variants (gz) are loaded into a data frame df_ld_snps and LD is loaded into a data frame df_R. If you are interested in the first ten variants, you may subset the data frames to extract the data for the ten variants of interest.

df_snp = df_snp[1:10] df_ld_snps = df_ld.loc[1:10, 1:10]

This is the answer I got from my colleague as I didn't work this type of data before. Hope this could help.

Best, Zikun

1667857557 commented 1 year ago

I got it,thanks for your assist!

ZikunY / CARMA

LD Matrix File Format #15