dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 41 forks source link

Popgen error AttributeError: 'str' object has no attribute 'decode' #492

Open mydjc opened 2 years ago

mydjc commented 2 years ago

I run popgen following the cookbook-popgen-sumstats.ipynb . Then the error occured.

In:popgen = Popgen(data=data, imap=imap)
       popgen.params

Traceback (most recent call last):
  File "/home/mydjc/miniconda3/envs/ipyrad/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-5ea216490aea>", line 1, in <cell line: 1>
    popgen = Popgen(data=data, imap=imap)
  File "/home/mydjc/miniconda3/envs/ipyrad/lib/python3.10/site-packages/ipyrad/analysis/popgen.py", line 81, in __init__
    self._check_files(data)
  File "/home/mydjc/miniconda3/envs/ipyrad/lib/python3.10/site-packages/ipyrad/analysis/popgen.py", line 160, in _check_files
    self.snps[name.decode("utf-8")] = io5["snps"][idx]
AttributeError: 'str' object has no attribute 'decode'

And as you said in https://github.com/eaton-lab/tetrad/issues/5#issuecomment-872811206 , I modified the similar lines, which contain ".decode", in the popgen.py and locus_extracter.py , because these line has the same AttributeError as above.

#original sentence:

self.snps[name.decode("utf-8")] = io5["snps"][idx]

#modified sentence:
try:
     self.snps[name.decode("utf-8")] = io5["snps"][idx]
except AttributeError:
     self.snps[name] = io5["snps"][idx]

But, it is not useful, because I got a new error "ZeroDivisionError: float division by zero" Traceback in the utils.py lines 201.

So, is the script for this function not complete now?

isaacovercast commented 2 years ago

Hello, thank you for pointing this out and for giving such a detailed bug report. The 'decode' error is an old py2.7/py3 compatibility issue, and yes the fix that you propose in the code snippet would be my suggested way to handle it.

The divide by zero error is a different problem that's coming from some other part of the code, so if you could post the traceback for it that would be helpful. In general, the popgen module is not the most complete of the analysis tools, so I would not be surprised if there still were some bugs in it.

mydjc commented 2 years ago
Encountered an Error.
Message: float division by zero
Traceback (most recent call last):
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/core/Parallel.py", line 314, in wrap_run
    self.tool._run(ipyclient=self.ipyclient, **self.rkwargs)
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/analysis/popgen.py", line 266, in _run
    prog.update()
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/analysis/utils.py", line 41, in update
    hashes = '#' * int(self.progress / 5.)
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/analysis/utils.py", line 33, in progress
    return 100 * (self.finished / float(self.njobs))
ZeroDivisionError: float division by zero
isaacovercast commented 2 years ago

Hm, well it looks like this would only ever happen if the number of loci that is being processed is zero. Can you show me the cell where you create the 'Popgen' instance and also the output from that cell? If you pass in an imap or minmap that is too restrictive it will cause all the loci to be removed, and then the run will crash.

mydjc commented 2 years ago

In:

data = ipyrad.load_json("/run/media/mydjc/WinSto/Sinopodophyllum/test2/test2.json")
imap = {
    "reference": ["reference"],
    "T1": ["T1-1", "T1-2"],
    "T2": ["T2-1", "T2-2"],
    "Z": ["Z1-1"],
}
popgen = Popgen(data=data, imap=imap)
popgen.run(ipyclient=ipyclient)

out:

Parallel connection | mydjc-imac201: 11 cores
[locus filter] full data: 130690
[locus filter] post filter: 0

Encountered an Error.
Message: float division by zero
Traceback (most recent call last):
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/core/Parallel.py", line 314, in wrap_run
    self.tool._run(ipyclient=self.ipyclient, **self.rkwargs)
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/analysis/popgen.py", line 266, in _run
    prog.update()
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/analysis/utils.py", line 41, in update
    hashes = '#' * int(self.progress / 5.)
  File "/home/mydjc/miniconda3/envs/ipyrad_py37/lib/python3.7/site-packages/ipyrad/analysis/utils.py", line 33, in progress
    return 100 * (self.finished / float(self.njobs))
ZeroDivisionError: float division by zero
###########################################**outfilis/test2_stats.txt#**#######################################

## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                           total_filters applied_order retained_loci
total_prefiltered_loci                 0             0        305133
filtered_by_rm_duplicates              0             0        305133
filtered_by_max_indels                 0             0        305133
filtered_by_max_SNPs                6902          6902        298231
filtered_by_max_shared_het         59537         56271        241960
filtered_by_min_sample            111270        111270        130690
total_filtered_loci               177709        174443        130690

## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

           sample_coverage
reference           130690
T1-1                 79881
T1-2                 74038
T2-1                 88748
T2-2                 82434
Z1-1                 36703

## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

   locus_coverage  sum_coverage
1               0             0
2           58102         58102
3           49021        107123
4           19298        126421
5            4269        130690
6               0        130690

The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps
## The "reference" sample is included if present unless 'exclude_reference=True'

      var  sum_var    pis  sum_pis
0   30836        0  98691        0
1   11479    11479   9903     9903
2    8407    28293   6397    22697
3    7359    50370   4481    36140
4    6545    76550   3313    49392
5    6140   107250   2413    61457
6    5715   141540   1753    71975
7    5422   179494   1231    80592
8    5017   219630    803    87016
9    4647   261453    564    92092
10   4226   303713    435    96442
11   4025   347988    308    99830
12   3729   392736    158   101726
13   3309   435753    103   103065
14   2967   477291     60   103905
15   2692   517671     40   104505
16   2483   557399     15   104745
17   2137   593728      8   104881
18   1886   627676      6   104989
19   1720   660356      4   105065
20   1475   689856      0   105065
21   1323   717639      1   105086
22   1174   743467      1   105108
23    956   765455      2   105154
24    876   786479      0   105154
25    742   805029      0   105154
26    656   822085      0   105154
27    567   837394      0   105154
28    463   850358      0   105154
29    369   861059      0   105154
30    313   870449      0   105154
31    210   876959      0   105154
32    190   883039      0   105154
33    153   888088      0   105154
34    123   892270      0   105154
35    106   895980      0   105154
36     72   898572      0   105154
37     52   900496      0   105154
38     51   902434      0   105154
39     16   903058      0   105154
40     19   903818      0   105154
41     18   904556      0   105154
42      8   904892      0   105154
43      6   905150      0   105154
44      5   905370      0   105154
45      2   905460      0   105154
46      2   905552      0   105154
47      0   905552      0   105154
48      0   905552      0   105154
49      2   905650      0   105154

## Final Sample stats summary
      state  reads_raw  reads_passed_filter  refseq_mapped_reads  refseq_unmapped_reads  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
T1-1      7   13682541             13645443              6079794                7565649          361040            221379    0.055388   0.026210         140761             79881
T1-2      7   13961420             13932379              6418573                7513806          347608            207330    0.053587   0.026096         129944             74038
T2-1      7   34419819             34362364             16972741               17389623          416323            329392    0.056436   0.026109         178700             88748
T2-2      7   14733657             14713674              7198375                7515299          365665            232474    0.055524   0.025562         140738             82434
Z1-1      7   22123712             22083562              6085916               15997646          285921            143015    0.070033   0.029408          72794             36703

## Alignment matrix statistics:
snps matrix size: (6, 905650), 48.82% missing sites.
sequence matrix size: (6, 27793782), 44.03% missing sites.
isaacovercast commented 2 years ago

Yes, well you can see here this is exactly what's happening:

[locus filter] full data: 130690
[locus filter] post filter: 0

You have 0 loci that are shared among all samples (as from your stats file):

6               0        130690

If you pass in a population map file and no 'minmap' then it defaults to 4 samples per population (a somewhat permissive lower bound for calculating popgen sumstats).

More importantly than this, the popgen analysis tool will calculate population summary statistics, and if your 'populations' you are assigning have only one or 2 individuals, you're not really going to get meaningful results.