Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
118 stars 22 forks source link

VCF export #55

Open gmauro opened 9 months ago

gmauro commented 9 months ago

Hi,

I'm trying to export a cleaned sumstats in VCF format, but I got the following error message:

Fri Sep 22 11:18:36 2023  -Since num_hg19 << num_hg38, assigning genome build hg38...
Fri Sep 22 11:18:51 2023 Finished inferring genome build version using hapmap3 SNPs...
                       SNPID  CHR       POS  EA NEA       EAF      BETA        SE         P     N   STATUS
0             chr1:39601:T:C    1     39601   C   T  0.001573  0.272299  0.179760  0.129825  9219  3860099
1            chr1:39743:TA:T    1     39743   T  TA  0.001573  0.272299  0.179760  0.129825  9219  3860399
2             chr1:71693:G:A    1     71693   A   G  0.000651 -0.022955  0.279190  0.934473  9219  3860099
3            chr1:80105:T:TA    1     80105  TA   T  0.000597  0.031138  0.291589  0.914959  9219  3860399
4             chr1:87048:A:G    1     87048   G   A  0.001952 -0.198541  0.161401  0.218656  9219  3860099
...                      ...  ...       ...  ..  ..       ...       ...       ...       ...   ...      ...
16082711  chr22:50803084:C:G   22  50803084   G   C  0.001790  0.229451  0.168551  0.173412  9219  3860099
16082712  chr22:50803384:C:G   22  50803384   G   C  0.000976 -0.035205  0.228032  0.877305  9219  3860099
16082713  chr22:50803843:C:G   22  50803843   G   C  0.102343 -0.011162  0.024340  0.646529  9219  3860099
16082714  chr22:50804129:A:T   22  50804129   T   A  0.017789 -0.000490  0.054171  0.992777  9219  3860099
16082715  chr22:50806802:G:A   22  50806802   A   G  0.005912 -0.077148  0.093127  0.407439  9219  3860099

[16082716 rows x 11 columns]
Fri Sep 22 11:18:52 2023 Start to format the output sumstats in:  vcf  format
Fri Sep 22 11:18:52 2023  -Formatting statistics ...
Fri Sep 22 11:19:14 2023  - Float statistics formats:
Fri Sep 22 11:19:14 2023   - Columns: ['EAF', 'BETA', 'SE', 'P']
Fri Sep 22 11:19:14 2023   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Fri Sep 22 11:19:14 2023  - Start outputting sumstats in vcf format...
Fri Sep 22 11:19:16 2023  -vcf format will be loaded...
Fri Sep 22 11:19:16 2023  -vcf format meta info:
Fri Sep 22 11:19:16 2023   - format_name  :  vcf
Fri Sep 22 11:19:16 2023   - format_source  :  https://github.com/MRCIEU/gwas-vcf-specification/tree/1.0.0
Fri Sep 22 11:19:16 2023   - format_version  :  20220923
Fri Sep 22 11:19:16 2023   - format_citation  :  Lyon, M.S., Andrews, S.J., Elsworth, B. et al. The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol 22, 32 (2021). https://doi.org/10.1186/s13059-020-02248-0
Fri Sep 22 11:19:16 2023   - format_fixed  :  ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT']
Fri Sep 22 11:19:16 2023   - format_format  :  ['ID', 'SS', 'ES', 'SE', 'LP', 'SI', 'EZ']
Fri Sep 22 11:19:16 2023  -gwaslab to vcf format dictionary:
Fri Sep 22 11:19:16 2023   - gwaslab keys: rsID,CHR,POS,NEA,EA,N,EAF,BETA,SE,MLOG10P,INFO,Z,SNPID
Fri Sep 22 11:19:16 2023   - vcf values: ID,#CHROM,POS,REF,ALT,SS,AF,ES,SE,LP,SI,EZ,ID
Fri Sep 22 11:19:38 2023  -Output path: ./test.vcf
Fri Sep 22 11:19:38 2023  -vcf header contig build:38
Traceback (most recent call last):
  File "check_build.py", line 13, in <module>
    mysumstats.to_format("./test",fmt="vcf",bgzip=True,tabix=True, build="38")
  File "/home/gianmauro.cuccuru/mambaforge/envs/gwaslab/lib/python3.8/site-packages/gwaslab/Sumstats.py", line 714, in to_format
    tofmt(output,
  File "/home/gianmauro.cuccuru/mambaforge/envs/gwaslab/lib/python3.8/site-packages/gwaslab/to_formats.py", line 315, in tofmt
    if verbose: log.write(" -Output columns:"," ".join(meta_data["format_fixed"]+[meta["gwaslab"]["study_name"]]))
TypeError: sequence item 9: expected str instance, NoneType found

This is my code:

mysumstats = gl.Sumstats(path, fmt="auto")
mysumstats.basic_check()
mysumstats.infer_build()
print(mysumstats.data)
mysumstats.to_format("./test",fmt="vcf",bgzip=True,tabix=True, build="38")

I have installed v3.4.26

Thanks in advance for your help.

Cloufield commented 9 months ago

Hi, Thanks for reporting the error.

For vcf format, because you need a study ID (the sample ID in normal vcf files), you can add study option when loading sumstats like mysumstats = gl.Sumstats(path, fmt="auto",study=“your_study_name_here”) or just change it after loading using: mysumstats.meta["gwaslab"]["study_name"]="your_study_name_here".

The default is None, which cause the error. I guess I will change the default value in next version.

gmauro commented 9 months ago

Thanks for prompt reply. It works.