dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

vcf2hdf5 #449

Closed pengyan19 closed 3 years ago

pengyan19 commented 3 years ago

This is my code: converter = ipa.vcf_to_hdf5(name="yanfen_LD20K",data=vcf,ld_block_size=20000) But I met this question as follows,it may be cause by pandas ,can you help me to deal with it ValueError: Length of passed values is 2262938, index implies 20026.

isaacovercast commented 3 years ago

what version of ipyrad are you using?

pengyan19 commented 3 years ago

i used the ipyrad 0.9.63, i guess the pandas version which could affect this results.And i used the pandas which was 1.0.3. And i also used python2.7 to install this software, but i also failed in this steps which didn't have this module to transform vcf to hdf5.

isaacovercast commented 3 years ago

Python 2.7 is no longer supported, so please use python3 and try again. Also please use the most recent version of ipyrad, as it's possible we've already fixed this problem. Let me know how it goes.

pengyan19 commented 3 years ago

I have tried python 3.7,but I also met this question.could you recommend me use which pandas or numpy version to install.

isaacovercast commented 3 years ago

Python 3.7 should work. If you install ipyrad in a clean conda environment then it will pull down all required libraries.

conda create -n ipyrad_env python=3.8
conda activate ipyrad_env
conda install -c conda-forge -c bioconda ipyrad

This will install the most recent version of ipyrad and all required libraries.

pengyan19 commented 3 years ago

I also bulid new env ,i l also Python3.8 to install this software,but I guess panda which have new version,it could not have this using in pandas

isaacovercast commented 3 years ago

This is what I have in a working environment:

numpy 1.19.4 py38hf0fd68c_1 conda-forge pandas 1.1.4 py38h0ef3d22_0 conda-forge

pengyan19 commented 3 years ago

I also have the same question. i have bulid a new env, and i also install ipyrad 0.9.78. in this env, i install python3.8.can you send me a vcf file which you test in my email?this is my email:1300538321@qq.com. i test your vcf

Indexing VCF to HDF5 database file VCF: 20026 SNPs; 251 scaffolds [ ] 0% 0:00:00 | converting VCF to HDF5 Traceback (most recent call last): File "03.vcf2hdf5.py", line 21, in converter.run() File "/public/home/pengyan/anaconda3/envs/ipyrad/lib/python3.8/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 82, in run self.build_chunked_matrix() File "/public/home/pengyan/anaconda3/envs/ipyrad/lib/python3.8/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 204, in build_chunked_matrix genos, snps = chunk_to_arrs(chunkdf, self.nsamples) File "/public/home/pengyan/anaconda3/envs/ipyrad/lib/python3.8/site-packages/ipyrad/analysis/vcf_to_hdf5.py", line 522, in chunk_to_arrs ref = chunkdf.iloc[:, 3].astype(bytes).view(np.int8).values File "/public/home/pengyan/anaconda3/envs/ipyrad/lib/python3.8/site-packages/pandas/core/series.py", line 667, in view return self._constructor( File "/public/home/pengyan/anaconda3/envs/ipyrad/lib/python3.8/site-packages/pandas/core/series.py", line 313, in init raise ValueError( ValueError: Length of passed values is 2262938, index implies 20026.

isaacovercast commented 3 years ago

I emailed a vcf to test.

pengyan19 commented 3 years ago

ok,i know the question which I caused, because I have too much chromsome .if i change the format to number,it works well.and i have other question when i used the treemix,i want to used all snp, but ipyrad have filtered as follows. And I need choose the best edge according to the likehood ?.But we didn't knows that which tree can explain more variants. do you know how to calculate? Samples: 221 Sites before filtering: 19366 Filtered (indels): 0 Filtered (bi-allel): 0 Filtered (mincov): 0 Filtered (minmap): 19246 Filtered (subsample invariant): 2611 Filtered (minor allele frequency): 0 Filtered (combined): 19264 Sites after filtering: 102 Sites containing missing values: 52 (50.98%) Missing values in SNP matrix: 615 (2.73%) SNPs (total): 102 SNPs (unlinked): 98 subsampled 98 unlinked SNPs

isaacovercast commented 3 years ago

Ah good, glad you figured out the vcf conversion problem. I will close this issue as now the original problem has been resolved.

As for the treemix question, this is less an ipyrad issue than it is a question about usage, which is more appropriate for the gitter channel: https://gitter.im/dereneaton/ipyrad

I am not sure I understand your question exactly. Can you please try restating your question and posting it to the gitter channel? Thanks!