JonJala / mama

MIT License
13 stars 4 forks source link

LD Reference #14

Closed samkleeman1 closed 3 years ago

samkleeman1 commented 3 years ago

Hi,

Congratulations on the preprint and many thanks for making this available.

I am just about to make our own LD reference files according to your detailed tutorial, to cover EUR, AFR, CSA and EAS ancestries. I was wondering what your thoughts were about using the 1000 Genomes and Human Genome Diversity Project callset provided by gnomad for the input (MAF >1%, variants and samples passing their QC)? This would be instead of the phase 3 imputed data that you used in the preprint. Alternatively I wondered about using the Biobank data as input, as our summary statistics are all from UKB. Very grateful for any thoughts or suggestions on this matter, and I would be happy to make the reference files available for others in future.

Kind regards,

Sam Kleeman PhD Student, Cold Spring Harbor Laboratory

paturley commented 3 years ago

Hi Sam!

Thanks! We are excited to finally have it out for people to start using.

That's a great question. I can't think of a reason that you couldn't use any data set as a reference panel as long as it is a good match for the GWAS samples that are used and the genotyping is dense enough. That would be my main considerations with using the UKB. One one hand, if you are using a UKB GWAS, then the ancestry matching would be perfect. However, if the there are SNPs that aren't included in the UKB data (perhaps because they weren't genotyped or weren't imputed well enough) and if there are big LD differences between the genotyped and ungenotyped SNPs between ancestries, that could affect MAMA's performance.

Best,. Patrick

samkleeman1 commented 3 years ago

Hi Patrick,

Many thanks for your detailed response, much appreciated. The imputation point is an important one, the recent TOPMED paper seems to suggest that HRC based imputation (i.e. what UKB used) for non-European ancestries does not perform very well. This might be a reason to use whole genome sequencing, which is directly genotyped rather than imputed. I am rather in two minds as to whether to go for the 1000G/HGDP panel (approx 3000 subjects, with well characterized ancestry assignments) versus using HRC-imputed UKB data.

Sam

samkleeman1 commented 3 years ago

Also - I am good to include four different ancestry groups as the input to python3 ../mama_ldscores.py

paturley commented 3 years ago

You should be able to use as many different ancestry groups as you'd like (though we haven't really tested more than 3). If you run into problems with four or more, please let us know.

On Thu, May 6, 2021 at 2:38 PM samkleeman1 @.***> wrote:

Also - I am good to include four different ancestry groups as the input to python3 ../mama_ldscores.py

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/14#issuecomment-833765494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5OAKFBVRTKCEEP6AXLTMLOZJANCNFSM44GDHU5A .

samkleeman1 commented 3 years ago

This line...

all_snps = pd.DataFrame(pd.concat(bims.values())[1].drop_duplicates()) ...is creating some issues for me - in the UK biobank dataset there are some "duplicated" SNPs e.g. a single rsid refers to two or more single nucleotide polymorphisms e.g. A-->T and A-->G. This means that the following commands lead to the following error:

Length of values (9572556) does not match length of index (9564716)

Shall I just remove all "duplicated" SNPs from the input plink files?

paturley commented 3 years ago

Hmm. This is from the tutorial? I think that the software should be robust to duplicates, but it's possible that the tutorial isn't. In the meantime, removing duplicates should probably solve this problem, but we will try to fix this in the longer run. @JonJala Can you check on this?

Thanks!

samkleeman1 commented 3 years ago

Yes this is from the tutorial.

On an unrelated note I think you could consider adding the following lines after 'clean the ancestry file' in https://github.com/JonJala/mama/blob/092297fe357349bb525602353ba3e3b2dc3b339e/legacy/mama_ldscore.py. Otherwise it does not correctly merge together the two pandas dataframes.

array_indivs.IDList["IID"] = pd.to_numeric(array_indivs.IDList["IID"])
df_ances.IID = pd.to_numeric(df_ances.IID)
samkleeman1 commented 3 years ago

This line in the tutorial creates the duplicate issue

all_snps = pd.DataFrame(pd.concat(bims.values())[1].drop_duplicates())

JonJala commented 3 years ago

Ah, ok. Thanks for pointing this stuff out. The tutorial is a pretty recent thing and not as robust as the MAMA software should hopefully be (especially the main method, after all the LD score creation is done), and clearly there are still some kinks to work out.

ggoldman1 commented 3 years ago

Hi @samkleeman1 -- I'm going to help out with this.

Yes, I am assuming that the user does not have duplicated SNP's in their .bim file(s). Have you tried dropping duplicate SNPs in your .bim files (prior to running 1.make_defn_files.ipynb?) I think that's probably the cleanest way to deal with this.

samkleeman1 commented 3 years ago

Yes that's what I did, and that fixed the problem. But I guess we are removing most/all common multi-allelic SNPs, there are only a few thousands though.

Also re above, adding these lines works better

array_indivs.IDList["IID"] = array_indivs.IDList["IID"].astype(str)
df_ances.IID = df_ances.IID.astype(str)
ggoldman1 commented 3 years ago

Ok cool, I'm glad you were able to resolve this! I'm just trying to think if there's a way to keep the multi-allelic SNPs without any issues. I'll need to revisit our LDSC construction and get back to you. Just out of curiosity, are you able to run the standard LDSC module with your data?

samkleeman1 commented 3 years ago

I would have to double check and get back to you re LDSC

ggoldman1 commented 3 years ago

No worries! Was just curious but no need to deal with that (unless you want to).

Thanks for that suggestion on the mama_ldscore.py script, by the way! The second suggestion looks good to me (was worried with the first that casting to numeric would be problematic for IID's that contain non-numeric characters).

I'm going to close this out and add a note about dropping duplicate rsID's (at least for now) and look into your suggestion with mama_ldscore.py. Thanks again for raising this! Let us know if anything else comes up.

Best, Grant