Closed chenyangjjj closed 1 year ago
and one similar problem: I am converting rsid to chrpos:
data_Sasayama_2017 = pd.read_csv("/Users/jiangxiaofan/Desktop/Alzheimentrum/Lianne/pQTL_Novelty_check/GWAS_database/inputs2/data_Sasayama_2017_test.csv")
data_Sasayama_2017 = data_Sasayama_2017.head(1)
data_Sasayama_2017_Hg19_gl = gl.Sumstats(data_Sasayama_2017,rsid="ID",
other = ["Name","Entrez Gene Symbol","FDR"])
data_Sasayama_2017_Hg19_gl.rsid_to_chrpos(path=gl.get_path("1kg_dbsnp151_hg38_auto"))
data_Sasayama_2017_Hg19_gl.head()
Hi,
Sorry for the errors.
For the first one, would you please check if you also downloaded the index file .tbi
for the vcf file in ~/.gwaslab/
.
For the second one, the error is caused by duplicated IDs in "1kg_dbsnp151_hg38_auto". I will fix this in v3.4.23.
Hi, I download "00-All.vcf.gz" using gl.download_ref("dbsnp_v151_hg19") in ~/.gwaslab/.
And then I manually downloaded GCF_000001405.25.gz.tbi, and it still doesn't work."Could not retrieve index file"
or it better to manually download vcf:https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz and tbi:https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi. ?
Hi,
I think the reason is that the tbi doesn't match the vcf file.
You may need to download 00-All.vcf.gz.tbi
in https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/ instead of GCF_000001405.25.gz.tbi
.
Or simply download both GCF_000001405.25.gz
and GCF_000001405.25.gz.tbi
and pass the path of GCF_000001405.25.gz to ref_rsid_vcf
Actually, gwaslab should automatically download the tbi, but it seems it is not working. I will check it and also update this in the next version.
Hi Yunye, Thanks for the reply. I downloaded the GCF_000001405.25.gz and GCF_000001405.25.gz.tbi by hand. Now it works! (limited internet use before and not abel to download and check).
It seems that The "infering rsid" is based on the CHR:BP:EA:NEA information. Can we just infer rsid from CHR:BP?
I have a sumstats without rsid, but many variants with unmatched ref allele and alternative allele. So those variants failed to infer rsid. Is there any solution for this?
Thank you so much!
Hi,
You can use harmonize
to align alleles using a reference genome, and then assign the rsIDs:
mysumstats.harmonize(
basic_check=False,
ref_seq = "/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta"
)
https://cloufield.github.io/gwaslab/tutorial_3.4/#harmonization
Since using only CHR:BP will cause many problems especially for indels, gwaslab use CHR:BP:EA:NEA to ensure the correctness.
Thanks for solution!
I still have one question: How gwaslab handle with incomplete multiallelic variants (like inferring rsid)?
For example: one SNPID: 1:2432331:A:C, Hg19 in my data. But it is a multiallelic variant(C>A,G). I failed to infer rsid from the "1:2432331:A:C".
The below is my code:
data_Carlos2021_Hg19_gl.harmonize(
basic_check=True,
ref_seq = "/Users/jiangxiaofan/.gwaslab/human_g1k_v37.fasta")
data_Carlos2021_Hg19_gl_harmonized = data_Carlos2021_Hg19_gl
data_Carlos2021_Hg19_gl_harmonized.basic_check()
data_Carlos2021_Hg19_gl_harmonized.assign_rsid(
ref_rsid_tsv = gl.get_path("1kg_dbsnp151_hg19_auto"),
ref_rsid_vcf = "/Users/jiangxiaofan/.gwaslab/GCF_000001405.25.vcf.gz",
chr_dict = gl.get_number_to_NC(build="19"),
n_cores = 6)
Thanks again!
Hi,
for ref_rsid_tsv
, it is just a SNPID-rsID conversion table, when SNPID (in the format of CHR:POS:NEA:EA) is matched, rsID will be assigned.
for ref_rsid_vcf
, gwaslab extract variants in the positions (CHR:POS) using tabix, and compare if NEA/EA or EA/NEA matches REF/ALT. If matched, rsID will be assigned.
I tested it on my device and it worked well.
For multiallelic variants, the ALT alleles in vcf file are a list, gwaslab will check each component. For rs4648641(C>A,G), the information extracted from vcf file will be CHR=1, POS=2432331, REF="C", and ALT=["A","G"]. gwaslab will assign rsID when NEA==REF and EA in ALT
or EA==REF and NEA in ALT
.
BTW, the SNPID format in gwaslab should be CHR:POS:NEA:EA, or CHR:POS:REF:ALT.
That's a little weird, still report "-rsID Annotation for 1 need to be fixed!" and not able to infer rs4648641 for same variants
I used the below codes to harmonize,
data_Carlos2021_Hg19_gl = gl.Sumstats(data_Carlos_Hg19,chrom="CHR",pos="BP",snpid="SNP",ea="altAllele",nea= "refAllele",
other = ["Start","Protein","Gene","P_value"])
data_Carlos2021_Hg19_gl.fix_id(fixchrpos=True,
fixid=True,
fixsep=True,
forcefixid=True,
overwrite=True)
data_Carlos2021_Hg19_gl.harmonize(
basic_check=True,
ref_seq = "/Users/jiangxiaofan/.gwaslab/human_g1k_v37.fasta")
and then:
any clues for it?
Hi, After basic_check, the STATUS code should be changed. Based on your results, the STATUS code (9969999) shows that this SNP was not checked yet. Assign_rsid will work only for variants that have been checked. Would you please run basic_check again before assign_rsid? (basic_check is integrated into harmonize, i am not sure why it was not working in your case. Could you please also provide the log for previous steps. Thanks. )
Example (I manually added you SNP; please see how the status code changed after each step):
Hi, I realized I gave the wrong file name for the parameter ref_rsid_vcf, It should be GCF_000001405.25.gz in my devices, not GCF_000001405.25.vcf.gz
I'm truly sorry for taking up so much of your time. And Thanks for the patience and help!
Hi, I am using assign_rsid function to infer rsid from SNPID information, But got some error about the reference file. Is something wrong with the scripts? Thanks!
Error report: