Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
151 stars 25 forks source link

Could not retrieve index file #47

Closed chenyangjjj closed 1 year ago

chenyangjjj commented 1 year ago

Hi, I am using assign_rsid function to infer rsid from SNPID information, But got some error about the reference file. Is something wrong with the scripts? Thanks!

data_Carlos_Hg19_gl = gl.Sumstats(data_Carlos_Hg19,chrom="CHR",pos="BP",snpid="SNP",
                                  nea="refAllele",ea="altAllele",)
data_Carlos_Hg19_gl.basic_check()
data_Carlos_Hg19_gl.assign_rsid(
                        ref_rsid_vcf = gl.get_path('dbsnp_v151_hg19'), 
                        ref_rsid_tsv = gl.get_path("1kg_dbsnp151_hg19_auto"),
                        chr_dict = gl.get_number_to_NC(build="19"),
                        n_cores = 4)`

Error report:

image
chenyangjjj commented 1 year ago

and one similar problem: I am converting rsid to chrpos:

data_Sasayama_2017 = pd.read_csv("/Users/jiangxiaofan/Desktop/Alzheimentrum/Lianne/pQTL_Novelty_check/GWAS_database/inputs2/data_Sasayama_2017_test.csv")
data_Sasayama_2017 = data_Sasayama_2017.head(1)
data_Sasayama_2017_Hg19_gl = gl.Sumstats(data_Sasayama_2017,rsid="ID",
                                       other = ["Name","Entrez Gene Symbol","FDR"])
data_Sasayama_2017_Hg19_gl.rsid_to_chrpos(path=gl.get_path("1kg_dbsnp151_hg38_auto"))
data_Sasayama_2017_Hg19_gl.head()
image image
Cloufield commented 1 year ago

Hi, Sorry for the errors. For the first one, would you please check if you also downloaded the index file .tbi for the vcf file in ~/.gwaslab/. For the second one, the error is caused by duplicated IDs in "1kg_dbsnp151_hg38_auto". I will fix this in v3.4.23.

chenyangjjj commented 1 year ago

Hi, I download "00-All.vcf.gz" using gl.download_ref("dbsnp_v151_hg19") in ~/.gwaslab/.

And then I manually downloaded GCF_000001405.25.gz.tbi, and it still doesn't work."Could not retrieve index file"

or it better to manually download vcf:https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz and tbi:https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi. ?

Cloufield commented 1 year ago

Hi, I think the reason is that the tbi doesn't match the vcf file. You may need to download 00-All.vcf.gz.tbi in https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/ instead of GCF_000001405.25.gz.tbi. Or simply download both GCF_000001405.25.gz and GCF_000001405.25.gz.tbi and pass the path of GCF_000001405.25.gz to ref_rsid_vcf

Cloufield commented 1 year ago

Actually, gwaslab should automatically download the tbi, but it seems it is not working. I will check it and also update this in the next version.

chenyangjjj commented 1 year ago

Hi Yunye, Thanks for the reply. I downloaded the GCF_000001405.25.gz and GCF_000001405.25.gz.tbi by hand. Now it works! (limited internet use before and not abel to download and check).

It seems that The "infering rsid" is based on the CHR:BP:EA:NEA information. Can we just infer rsid from CHR:BP?

I have a sumstats without rsid, but many variants with unmatched ref allele and alternative allele. So those variants failed to infer rsid. Is there any solution for this?

Thank you so much!

Cloufield commented 1 year ago

Hi,

You can use harmonize to align alleles using a reference genome, and then assign the rsIDs:

mysumstats.harmonize(
                     basic_check=False,
                     ref_seq     = "/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta"
)

https://cloufield.github.io/gwaslab/tutorial_3.4/#harmonization

Since using only CHR:BP will cause many problems especially for indels, gwaslab use CHR:BP:EA:NEA to ensure the correctness.

chenyangjjj commented 1 year ago

Thanks for solution!

I still have one question: How gwaslab handle with incomplete multiallelic variants (like inferring rsid)?

For example: one SNPID: 1:2432331:A:C, Hg19 in my data. But it is a multiallelic variant(C>A,G). I failed to infer rsid from the "1:2432331:A:C".

The below is my code:

data_Carlos2021_Hg19_gl.harmonize(
                     basic_check=True,
                     ref_seq     = "/Users/jiangxiaofan/.gwaslab/human_g1k_v37.fasta")

data_Carlos2021_Hg19_gl_harmonized = data_Carlos2021_Hg19_gl

data_Carlos2021_Hg19_gl_harmonized.basic_check()

data_Carlos2021_Hg19_gl_harmonized.assign_rsid( 
                        ref_rsid_tsv = gl.get_path("1kg_dbsnp151_hg19_auto"),
                        ref_rsid_vcf = "/Users/jiangxiaofan/.gwaslab/GCF_000001405.25.vcf.gz",
                        chr_dict = gl.get_number_to_NC(build="19"),
                        n_cores = 6)

Thanks again!

Cloufield commented 1 year ago

Hi, for ref_rsid_tsv, it is just a SNPID-rsID conversion table, when SNPID (in the format of CHR:POS:NEA:EA) is matched, rsID will be assigned. for ref_rsid_vcf, gwaslab extract variants in the positions (CHR:POS) using tabix, and compare if NEA/EA or EA/NEA matches REF/ALT. If matched, rsID will be assigned.

I tested it on my device and it worked well. image

Cloufield commented 1 year ago

For multiallelic variants, the ALT alleles in vcf file are a list, gwaslab will check each component. For rs4648641(C>A,G), the information extracted from vcf file will be CHR=1, POS=2432331, REF="C", and ALT=["A","G"]. gwaslab will assign rsID when NEA==REF and EA in ALT or EA==REF and NEA in ALT.

Cloufield commented 1 year ago

BTW, the SNPID format in gwaslab should be CHR:POS:NEA:EA, or CHR:POS:REF:ALT.

chenyangjjj commented 1 year ago

That's a little weird, still report "-rsID Annotation for 1 need to be fixed!" and not able to infer rs4648641 for same variants

I used the below codes to harmonize,

data_Carlos2021_Hg19_gl = gl.Sumstats(data_Carlos_Hg19,chrom="CHR",pos="BP",snpid="SNP",ea="altAllele",nea= "refAllele",
                                      other = ["Start","Protein","Gene","P_value"])
data_Carlos2021_Hg19_gl.fix_id(fixchrpos=True,
                fixid=True,
                fixsep=True,
                forcefixid=True,
                overwrite=True)
data_Carlos2021_Hg19_gl.harmonize(
                     basic_check=True,
                     ref_seq     = "/Users/jiangxiaofan/.gwaslab/human_g1k_v37.fasta")

and then:

image

any clues for it?

Cloufield commented 1 year ago

Hi, After basic_check, the STATUS code should be changed. Based on your results, the STATUS code (9969999) shows that this SNP was not checked yet. Assign_rsid will work only for variants that have been checked. Would you please run basic_check again before assign_rsid? (basic_check is integrated into harmonize, i am not sure why it was not working in your case. Could you please also provide the log for previous steps. Thanks. )

Example (I manually added you SNP; please see how the status code changed after each step): image

chenyangjjj commented 1 year ago

Hi, I realized I gave the wrong file name for the parameter ref_rsid_vcf, It should be GCF_000001405.25.gz in my devices, not GCF_000001405.25.vcf.gz

I'm truly sorry for taking up so much of your time. And Thanks for the patience and help!