dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

vcf_to_hdf5 and tetrad: 'str' object has no attribute 'decode' #451

Closed casparbein closed 3 years ago

casparbein commented 3 years ago

Hi,

in order to run a tetrad analysis on a couple of SNP datasets that we filtered with vcftools, I converted a bunch of vcf-files that were originally produced as output of an ipyrad assembly and subsequently filtered with vcftools into hdf5-files. I used the following command from the ipyrad analysis toolkit cookbook:

converter = ipa.vcf_to_hdf5(  
    name="0miss", 
    data="~/0miss.recode.vcf.gz")
converter.run()

Which runs without a problem. Now, when I try to use this file in a tetrad analysis with the following command:

tet = ipa.tetrad(
    name="octo",
    data="~/analysis-vcf2hdf5/0miss.snps.hdf5",
    nquartets=1e6,
    nboots=16,
)

I get the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-42-9df7aceb61a7> in <module>
      3     data="/data/home/wolfproj/wolfproj-03/analysis-vcf2hdf5/0miss.snps.hdf5",
      4     nquartets=1e6,
----> 5     nboots=16,
      6 )

/opt/miniconda3/envs/ipyrad/lib/python3.7/site-packages/tetrad/tetrad.py in __init__(self, name, data, workdir, nquartets, nboots, save_invariants, seed, load, *args, **kwargs)
    176         else:
    177             # if self.kwargs["initarr"]:
--> 178             self._init_seqarray()
    179 
    180         # check input files

/opt/miniconda3/envs/ipyrad/lib/python3.7/site-packages/tetrad/tetrad.py in _init_seqarray(self, quiet)
    334         assert ".snps.hdf5" in self.files.data, "data file is not .snps.hdf5"
    335         io5 = h5py.File(self.files.data, 'r')
--> 336         names = [i.decode() for i in io5["snps"].attrs["names"]]
    337         self.samples = names
    338         ntaxa = len(names)

/opt/miniconda3/envs/ipyrad/lib/python3.7/site-packages/tetrad/tetrad.py in <listcomp>(.0)
    334         assert ".snps.hdf5" in self.files.data, "data file is not .snps.hdf5"
    335         io5 = h5py.File(self.files.data, 'r')
--> 336         names = [i.decode() for i in io5["snps"].attrs["names"]]
    337         self.samples = names
    338         ntaxa = len(names)

AttributeError: 'str' object has no attribute 'decode'

I tried perfiltering indels and multiallelic SNPs, but apparently there are no indels in the vcf, and the error occurs invariably both with heavily filtered files and even the original output vcf produced by ipyrad (when converted to hdf5). When I use the snps.hdf5 file of the ipyrad output directly, however, I get no error and tetrad runs smoothly:

tet = ipa.tetrad(
    name="octo",
    data="~/ipyrad_assemblies_start/exclude_outfiles/exclude.snps.hdf5",
    nquartets=1e6,
    nboots=16,
)

## no error

Any idea what I am doing wrong? I am running ipyrad v. 0.9.65 and Python v.3.7.10 on a remote machine.

Thanks a lot in advance,

Bernhard

isaacovercast commented 3 years ago

Yes, hello Berhard. First, let me say, thank you for carefully including so much useful information in your issue, it's super helpful.

This is actually a known bug in the tetrad codebase, which I have actually "fixed" but I don't have permissions on the repository to apply said fix. Was going to make a pr, but I didn't get around to it. I posted the diff to fix this in the issue on the tetrad github:

https://github.com/eaton-lab/tetrad/issues/5#issuecomment-872811206

If you can clone the tetrad repo and apply this diff that'll be the fastest way to get you going. Otherwise watch that tetrad issue for when the fix is merged in.

Closing this ticket as a dupe of the tetrad one.