dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

ValueError: Length of values does not match length of index #495

Closed lnalinaf closed 1 year ago

lnalinaf commented 1 year ago

Hi!

I installed ipyrad as following: conda create -n ipyrad_env python=3.8 conda activate ipyrad_env conda install -c conda-forge -c bioconda ipyrad And the launch: import ipyrad.analysis as ipa import pandas as pd

converter = ipa.vcf_to_hdf5( name="test_convert", data="/data.vcf" ) converter.run()

gave me the error: Indexing VCF to HDF5 database file VCF: 4768231 SNPs; 167 scaffolds [ ] 0% 0:00:00 | converting VCF to HDF5 Traceback (most recent call last): File "/home/user/prs/probe/convert_hd5.py", line 8, in <module> converter.run(force=True) File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 82, in run self.build_chunked_matrix() File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 204, in build_chunked_matrix genos, snps = chunk_to_arrs(chunkdf, self.nsamples) File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 522, in chunk_to_arrs ref = chunkdf.iloc[:, 3].astype(bytes).view(np.int8).values File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/core/series.py", line 818, in view res_ser = self._constructor(res_values, index=self.index) File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/core/series.py", line 442, in __init__ com.require_length_match(data, index) File "/home/user/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/core/common.py", line 557, in require_length_match raise ValueError( ValueError: Length of values (18100000) does not match length of index (100000)

I tried different vcf files, plain vcf and vcf.gz, and also tried python 3.10.8, pandas 1.4.1, pandas 1.5.2, but the same error occurred.

isaacovercast commented 1 year ago

Where did the vcf file come from? Can you paste in the first several lines of data from the vcf? Perhaps it's formatted in a way we don't expect.

lnalinaf commented 1 year ago

From Nebula and smth like that. For example,

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NG1CM126FP

chr1 13813 rs1213979446 T G 229.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-1.807e+00;DB;DP=19;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQRankSum=-3.456e+00;QD=12.09;ReadPosRankSum=0.847;SOR=0.321;VQSLOD=1.94;culprit=FS GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:12,7:19:99:0|1:13813_T_G:258,0,460:13813 chr1 13838 rs28428499 C T 253.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.15;DB;DP=20;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQRankSum=-3.795e+00;QD=12.69;ReadPosRankSum=-1.930e-01;SOR=0.392;VQSLOD=2.36;culprit=FS GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:12,8:20:99:0|1:13813_T_G:282,0,459:13813

or

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default

1 10009 1_10009 A AC 0 RefCall . GT 0/0 1 10015 1_10015 A G 0 RefCall . GT 0/0 1 10021 1_10021 A G 0 RefCall . GT 0/0 1 10027 1_10027 A G 0 RefCall . GT 0/0 1 10033 1_10033 A G 0 RefCall . GT 0/0 1 10039 1_10039 A G 0 RefCall . GT 0/0 1 10045 1_10045 A G 0 RefCall . GT 0/0 1 10051 rs1052373574 A G 0 RefCall . GT 0/0

isaacovercast commented 1 year ago

I don't know what 'smth' means, is that a bioinformatics tool?

Is there only one sample in the vcf file? What are you trying to do exactly?

lnalinaf commented 1 year ago

It's produced by personal genomics service 'Nebula genomics', don't know what kind of variant caller they use. Yes, there is only one sample in the vcf file. I'm trying to convert vcf to hdf5 format to use it with keras.

isaacovercast commented 1 year ago

VCF files can be of all kinds of non-standard and wacky formats. The vcf_to_hdf5() converter inside ipyrad. analysis tools is guaranteed to work with ipyrad vcf files (which is what it was designed for), and it often times works with other vcf files (like from STACKS), but it is not guaranteed to work with random VCF files from different sources, because VCF format is highly flexible. You might be able to look at and figure out the differences between an ipyrad vcf file and the vcf you have, in order to manipulate your vcf into a format that vcf_to_hdf5 expects, but I would guess that would be a pretty big job. It's also going to output an hdf5 file that is formatted for internal ipyrad.analysis tools use, so unless you're planning on figuring out the structure of our hdf5 file, the output isn't going to be very useful for you, I'm afraid.

Good luck.

lnalinaf commented 1 year ago

Oh, I see. Thanks a lot!