NVIDIA-Genomics-Research / GEPSi

GEPSi simulates phenotypes for GWAS data based on a range of conditions and biological assumptions.
Apache License 2.0
11 stars 6 forks source link

Error converting raw/bim to h5/matrix_header files #12

Open saurabhbelsare opened 2 years ago

saurabhbelsare commented 2 years ago

Hi, I'm trying to run a toy example with GEPSi, but am running into an error at the first step, to convert my genotypes to h5 format the way GEPSi expects. Here is my toy vcf generated from an msprime simulation:

##fileformat=VCFv4.2
##source=tskit 0.4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  tsk_0indv   tsk_1indv   tsk_2indv   tsk_3indv   tsk_4indv   tsk_5indv   tsk_6indv   tsk_7indv   tsk_8indv   tsk_9indv
1   409 .   T   A   .   PASS    .   GT  0|1 0|0 0|1 0|0 0|0 0|0 0|0 0|0 1|0 0|0
1   752 .   C   T   .   PASS    .   GT  0|0 1|1 0|0 1|1 1|1 1|0 1|1 1|0 0|0 0|1
1   1043    .   C   A   .   PASS    .   GT  0|0 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1   1207    .   A   C   .   PASS    .   GT  0|0 1|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0
1   1952    .   C   A   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0
1   2521    .   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0
1   2542    .   C   T   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0
1   2707    .   C   T   .   PASS    .   GT  0|0 0|1 0|0 1|0 1|1 0|0 1|1 1|0 0|0 0|1
1   4352    .   C   T   .   PASS    .   GT  0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|1 0|1 1|0
1   5416    .   T   A   .   PASS    .   GT  0|0 1|1 0|0 1|1 1|1 1|0 1|1 1|0 0|0 0|1
1   6656    .   A   G   .   PASS    .   GT  1|1 0|0 1|1 0|0 0|0 0|1 0|0 0|1 1|1 1|0
1   7850    .   T   A   .   PASS    .   GT  0|0 0|0 1|0 0|0 0|0 0|0 0|0 0|1 0|1 1|0
1   9263    .   A   T   .   PASS    .   GT  0|0 1|1 0|0 1|1 1|1 1|0 1|1 1|0 0|0 0|1

I converted this vcf to .raw and .bim files using both, plink1.9 and plink2 with these commands:

plink --vcf output.vcf --recode A --out output
plink --vcf output.vcf --make-just-bim --out output

I try to convert this to the h5/matrix_header format using the following command:

gepsi genotype --data_path ./ --matrix_name output.raw --snplist_name output.bim --ignore_gene_map --features "NONE" --annotation_name NONE

However, I'm getting errors at this step. They are different errors for plink1.9 and plink2. The error while trying to convert raw/bim files generated from plink1.9 is

OverflowError: can't convert negative value to hsize_t
Exception ignored in: 'tables.utilsextension.malloc_dims'
Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 250, in _g_create_common
    self._v_objectid = self._create_carray(self._v_new_title)
OverflowError: can't convert negative value to hsize_t
OverflowError: can't convert negative value to hsize_t
Exception ignored in: 'tables.utilsextension.malloc_dims'
Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 250, in _g_create_common
    self._v_objectid = self._create_carray(self._v_new_title)
OverflowError: can't convert negative value to hsize_t
Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/bin/gepsi", line 8, in <module>
    sys.exit(main())
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/main.py", line 22, in main
    data = genotype_sim.simulate_genotype()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 36, in simulate_genotype
    return self.generate_genotype_file()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 87, in generate_genotype_file
    sf, risk_alleles = self.script_result_clean()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 136, in script_result_clean
    array_c = f.create_earray(f.root, 'data', tables.IntCol(), (0,num_feat), expectedrows=self.num_patients,filters=tables.Filters(complib='zlib', complevel=1))
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/file.py", line 1380, in create_earray
    ptobj = EArray(parentnode, name,
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 158, in __init__
    super(EArray, self).__init__(parentnode, name, atom, shape, title,
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 219, in __init__
    super(Array, self).__init__(parentnode, name, new, filters,
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/leaf.py", line 286, in __init__
    super(Leaf, self).__init__(parentnode, name, _log)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/node.py", line 264, in __init__
    self._v_objectid = self._g_create()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 182, in _g_create
    return self._g_create_common(self._v_expectedrows)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 250, in _g_create_common
    self._v_objectid = self._create_carray(self._v_new_title)
  File "tables/hdf5extension.pyx", line 1366, in tables.hdf5extension.Array._create_carray
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5S.c", line 1507, in H5Screate_simple
    invalid dataspace information

End of HDF5 error back trace

Problems creating the EArray.
Closing remaining open files:./genotype_100k_NONE.h5...done

and the error while trying to convert raw/bim files generated from plink2 is:

Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/bin/gepsi", line 8, in <module>
    sys.exit(main())
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/main.py", line 22, in main
    data = genotype_sim.simulate_genotype()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 36, in simulate_genotype
    return self.generate_genotype_file()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 87, in generate_genotype_file
    sf, risk_alleles = self.script_result_clean()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 139, in script_result_clean
    self.h5_append(df,f)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 183, in h5_append
    f.root.data.append(a)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 219, in append
    self._check_shape_append(nparr)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 196, in _check_shape_append
    raise ValueError(("the shapes of the appended object and the "
ValueError: the shapes of the appended object and the ``/data`` EArray differ in non-enlargeable dimension 1
Closing remaining open files:./genotype_100k_NONE.h5...done

Which version of plink should I use, and how should I fix the corresponding error to get the h5/matrix_header files?

Thank you!

mahdikafi commented 2 years ago

Hi, I hope you are doing well. I have the same problem. I wanted to create phenotypes for sim1000g simulated genotypes.

avantikalal commented 2 years ago

@mahdikafi can you try the following.