Closed lnalinaf closed 1 year ago
Where did the vcf file come from? Can you paste in the first several lines of data from the vcf? Perhaps it's formatted in a way we don't expect.
From Nebula and smth like that. For example,
chr1 13813 rs1213979446 T G 229.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.807e+00;DB;DP=19;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQRankSum=-3.456e+00;QD=12.09;ReadPosRankSum=0.847;SOR=0.321;VQSLOD=1.94;culprit=FS GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:12,7:19:99:0|1:13813_T_G:258,0,460:13813 chr1 13838 rs28428499 C T 253.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.15;DB;DP=20;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQRankSum=-3.795e+00;QD=12.69;ReadPosRankSum=-1.930e-01;SOR=0.392;VQSLOD=2.36;culprit=FS GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:12,8:20:99:0|1:13813_T_G:282,0,459:13813
or
1 10009 1_10009 A AC 0 RefCall . GT 0/0 1 10015 1_10015 A G 0 RefCall . GT 0/0 1 10021 1_10021 A G 0 RefCall . GT 0/0 1 10027 1_10027 A G 0 RefCall . GT 0/0 1 10033 1_10033 A G 0 RefCall . GT 0/0 1 10039 1_10039 A G 0 RefCall . GT 0/0 1 10045 1_10045 A G 0 RefCall . GT 0/0 1 10051 rs1052373574 A G 0 RefCall . GT 0/0
I don't know what 'smth' means, is that a bioinformatics tool?
Is there only one sample in the vcf file? What are you trying to do exactly?
It's produced by personal genomics service 'Nebula genomics', don't know what kind of variant caller they use. Yes, there is only one sample in the vcf file. I'm trying to convert vcf to hdf5 format to use it with keras.
VCF files can be of all kinds of non-standard and wacky formats. The vcf_to_hdf5() converter inside ipyrad. analysis tools is guaranteed to work with ipyrad vcf files (which is what it was designed for), and it often times works with other vcf files (like from STACKS), but it is not guaranteed to work with random VCF files from different sources, because VCF format is highly flexible. You might be able to look at and figure out the differences between an ipyrad vcf file and the vcf you have, in order to manipulate your vcf into a format that vcf_to_hdf5 expects, but I would guess that would be a pretty big job. It's also going to output an hdf5 file that is formatted for internal ipyrad.analysis tools use, so unless you're planning on figuring out the structure of our hdf5 file, the output isn't going to be very useful for you, I'm afraid.
Good luck.
Oh, I see. Thanks a lot!
Hi!
I installed ipyrad as following:
conda create -n ipyrad_env python=3.8 conda activate ipyrad_env conda install -c conda-forge -c bioconda ipyrad
And the launch:import ipyrad.analysis as ipa
import pandas as pd
converter = ipa.vcf_to_hdf5( name="test_convert", data="/data.vcf" )
converter.run()
gave me the error:
Indexing VCF to HDF5 database file
VCF: 4768231 SNPs; 167 scaffolds
[ ] 0% 0:00:00 | converting VCF to HDF5 Traceback (most recent call last):
File "/home/user/prs/probe/convert_hd5.py", line 8, in <module>
converter.run(force=True)
File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 82, in run self.build_chunked_matrix()
File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 204, in build_chunked_matrix
genos, snps = chunk_to_arrs(chunkdf, self.nsamples)
File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 522, in chunk_to_arrs
ref = chunkdf.iloc[:, 3].astype(bytes).view(np.int8).values
File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/core/series.py", line 818, in view res_ser = self._constructor(res_values, index=self.index)
File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/core/series.py", line 442, in __init__ com.require_length_match(data, index)
File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/core/common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (18100000) does not match length of index (100000)
I tried different vcf files, plain vcf and vcf.gz, and also tried python 3.10.8, pandas 1.4.1, pandas 1.5.2, but the same error occurred.