Closed alexkrohn closed 2 years ago
More info on the code that I'm running. I'm not sure if I could replicate this without my original dataset, mostly becuase this same code works on other datasets.
import ipyrad.analysis as ipa
import toyplot
import pandas as pd
# Load the data
data = "analysis-vcf2hdf5/structure-subset.snps.hdf5"
# Group into populations.
# Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')
imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()
# Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}
# init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure(
name="mate-subset-60inds",
data=data,
imap=imap,
minmap=minmap,
mincov=0.80,
)
#Samples: 60
#Sites before filtering: 227971
#Filtered (bi-allel): 21737
#Filtered (mincov): 227271
#Filtered (minmap): 220599
#Filtered (combined): 227337
#Sites after filtering: 634
#Sites containing missing values: 631 (99.53%)
#Missing values in SNP matrix: 6240 (16.40%)
struct.mainparams.burnin = 10000
struct.mainparams.numreps = 25000
struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)
#Parallel connection | Alexs-MacBook-Air.local: 4 cores
#[####################] 100% 0:00:15 | running 80 structure jobs
etable = struct.get_evanno_table([2,3,4,5])
etable
Which leads to the first error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-8-4d8b0eaddda6> in <module>
----> 1 etable = struct.get_evanno_table([2,3,4,5])
2 etable
//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in get_evanno_table(self, kvalues, max_var_multiple, quiet)
569 raise ValueError('max_variance_multiplier must be >1')
570
--> 571 table = _get_evanno_table(self, kvalues, max_var_multiple, quiet)
572 return table
573
//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in _get_evanno_table(self, kpops, max_var_multiple, quiet)
930 for kpop in kpops:
931 ## concat results for k=x
--> 932 reps, excluded = _concat_reps(self, kpop, max_var_multiple, quiet)
933
934 ## report if some results were excluded
//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in _concat_reps(self, kpop, max_var_multiple, quiet, **kwargs)
890 min_var_across_reps = np.min([i.var_lnlik for i in reps])
891 else:
--> 892 min_var_across_reps = reps[0].var_lnlik
893
894 ## iterate over reps
IndexError: list index out of range
Hi Alex, this will only happen if you have structure installed but it's not set as executable. ipyrad.analysis.structure tests for the presence of the structure binary and raises an error which helpfully suggests how to install this if it doesn't exist. The structure module tests with os.path.exists
, so if the binary exists then it passes this check. Later, when you call run(), if the binary exists, but is not set to be executable then the call will fail. I added a check for executability of the binaries (9a59482) that explains how to fix the mode using chmod
. This will go out in the next conda package, but for now you'll have to check and fix the file permissions on the structure binary by hand. Good luck.
Interesting. These same exact commands worked on this machine previously, but not with this dataset. (The dataset later worked on another machine.) I'll keep playing around with it to see if I can figure out why the permissions were wrong during this instance, but worked fine in others.
Does it work now? Does it work on the exact same computer with a different dataset? The dataset used shouldn't change the behavior in this way if my hypothesis is correct. Are you certain that your jupyter notebook is running in the expected conda env (which has ipyrad and structure installed)? The same command can work differently on the same computer if you are running jupyter inside different conda environments.
It still does not work. The same code works on the same computer with a
different dataset, both initiated from the same conda environment with
ipyrad 0.9.58 running. I figured it was something to do with the VCF file
(maybe because it was generated from vcftools
), but I tried the same
dataset on another machine just running structure (not through ipyrad or
jupyter notebook), and that worked. I'm stumped.
On Mon, Oct 4, 2021 at 9:32 AM Isaac Overcast @.***> wrote:
Does it work now? Does it work on the exact same computer with a different dataset? The dataset used shouldn't change the behavior in this way if my hypothesis is correct. Are you certain that your jupyter notebook is running in the expected conda env (which has ipyrad and structure installed)? The same command can work differently on the same computer if you are running jupyter inside different conda environments.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/464#issuecomment-933490483, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHA5I5UDNKDRLGLWN6DUFGUG5ANCNFSM5ESBD3YQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
The vcf file is imported and converted to hdf5 very early in the process of initializing the structure object, and yours seems to be importing fine. Any problems with the VCF format would arise at this stage, so I'm very dubious its a problem with the vcf file. What is the name of the conda environment you're running in? Can you show me conda env list
? Can you do this:
conda activate <your_ipyrad_environment>
which structure
ls -l /path/to/your/structure/binary
Of course you'll need to put the name of your conda env there, and also the actual path, not the fake path I put as a standin.
Can you do this:
ps -ef | grep notebook
Sure. I installed ipyrad in a Python 3.6 environment on my Mac, running OS 10.14.6
And with the notebook actually runnin g
Mm. How about this: which ipcluster
Well, it certainly all looks like it should work.... The next thing to try is running the structure command by hand outside the notebook. The command we build inside the structure module looks like this:
cmd = [
STRUCTURE,
"-m", mname,
"-e", ename,
"-K", str(kpop),
"-D", str(seed),
"-N", str(ntaxa),
"-L", str(nsites),
"-i", sname,
"-o", outname,
]
So if you can build this command by hand on the command line and run it (inside the py36 environment) then this should tell us the problem. The -m
, -e
, and -i
parameters should indicate the temporary files created inside your structure analysis directory, like this:
structure -m structure-analysis/tmp-watdo-3-1.mainparams.txt -e structure-analysis/tmp-watdo-3-1.extraparams.txt -K 3 -d 1234 -i structure-analysis/tmp-watdo-3-1.strfile.txt
The rest of the parameters should be straightforward. Run this and let me know what the output is.
You can also email me your vcf file and your "inds-and-pops-subset.txt" file.
I emailed the vcf and population assignment file.
Structure fails because there is no NUMIND or NUMLOCI line in the mainparams file.
I should have been more specific about completing the rest of the command line arguments, my apologies. The NUMIND and NUMLOCI error are because you didn't include the -N and -L command line arguments (these can be written to the file or provided on the command line, which is how ipyrad does it).
The vcf file you sent works for me:
though I wonder about the vcf file because it is quite different in terms of number of snps than what is reported above. Did you send me a subset of the vcf file you originally posted about?
I suspect this is a 'mac' thing, since i'm running on linux. Can you try adding the -N and -L arguments to the structure command as you had it above to see if it gets any further?
The other thing to try is updating to the most recent version of ipyrad (0.9.81) there's a non-zero probability that this problem you are reporting has already been fixed, if you're working with an older version.
I sent you the full vcf via email, sorry for the wrong file.
I used this command to run Structure:
structure -m tmp-mate-subset-60inds-3-1.mainparams.txt -e tmp-mate-subset-60inds-3-1.extraparams.txt -K 3 -d 1234 -i tmp-mate-subset-60inds-3-1.strfile.txt -N 60 -L 634
634 is the total number of sites remaining after filtering, not the total in the HDF5. Structure gives this error, which I'm not sure how to interpret. Why would it think there should be 1482 sites?
Would updating ipyrad just involve running conda update ipyrad
while in the environment that has ipyrad installed? Or do I have to reinstall ipyrad completely using conda?
The new file also works for me:
conda install -c bioconda ipyrad --force-reinstall
If it isn't an ipyrad version issue then its certainly a mac issue, we don't see many mac users, so bugs tend to hang around longer.
Interesting. I updated ipyrad and am still having the same issue. I guess it must be a Mac issue. Luckily other datasets still work with this code, so if I find myself in this situation again, maybe I'll have better luck figuring out what causes this behavior. Thanks a lot for trying to work though it with me.
It's not a mac thing. It must be something goofy with your environment. I did a clean conda install on a mac using the data you sent and it works fine:
cd ~
rm -rf miniconda3
bash Miniconda*
conda create -n ipyrad python=3.8
conda activate ipyrad
conda install -c conda-forge -c bioconda ipyrad
conda install structure clumpp -c ipyrad
import ipyrad.analysis as ipa
import toyplot
import pandas as pd
# Load the data
#data = "subset-and-08maxmissing.recode.vcf"
data = "subset-only-for-ipyrad.recode.vcf"
# Group into populations.
# Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')
imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()
# Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}
# init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure(
name="mate-subset-60inds",
data=data,
imap=imap,
minmap=minmap,
mincov=0.80,
)
Converting vcf to HDF5 using default ld_block_size: 20000
Typical RADSeq data generated by ipyrad/stacks will ignore this value.
You can use the ld_block_size parameter of the structure() constructor to change
this value.
Indexing VCF to HDF5 database file
VCF: 227971 SNPs; 78927 scaffolds
[ ] 0% 0:00:00 | converting VCF to HDF5
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[######## ] 43% 0:00:34 | converting VCF to HDF5
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[################# ] 87% 0:01:04 | converting VCF to HDF5
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[####################] 100% 0:01:12 | converting VCF to HDF5
HDF5: 227971 SNPs; 78928 linkage group
SNP database written to ./analysis-vcf2hdf5/subset-only-for-ipyrad.recode.snps.hdf5
Samples: 60
Sites before filtering: 227971
Filtered (indels): 0
Filtered (bi-allel): 21737
Filtered (mincov): 227271
Filtered (minmap): 220599
Filtered (subsample invariant): 35192
Filtered (minor allele frequency): 0
Filtered (combined): 229829
Sites after filtering: 509
Sites containing missing values: 508 (99.80%)
Missing values in SNP matrix: 5048 (16.53%)
SNPs (total): 509
SNPs (unlinked): 259
struct.mainparams.burnin = 10000
struct.mainparams.numreps = 25000
struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)
[###### ] 30% 0:03:06 | running 80 structure jobs
Good luck!
Very weird indeed! I'll keep digging. Thanks again for going through it in such depth with me.
On Mon, Oct 4, 2021 at 12:03 PM Isaac Overcast @.***> wrote:
It's not a mac thing. It must be something goofy with your environment. I did a clean conda install on a mac using the data you sent and it works fine:
cd ~ rm -rf miniconda3 bash Miniconda* conda create -n ipyrad python=3.8 conda activate ipyrad conda install -c conda-forge -c bioconda ipyrad conda install structure clumpp -c ipyrad
import ipyrad.analysis as ipa import toyplot import pandas as pd
Load the data
data = "subset-and-08maxmissing.recode.vcf"
data = "subset-only-for-ipyrad.recode.vcf"
Group into populations.
Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')
imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()
Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}
init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure( name="mate-subset-60inds", data=data, imap=imap, minmap=minmap, mincov=0.80, )
Converting vcf to HDF5 using default ld_block_size: 20000 Typical RADSeq data generated by ipyrad/stacks will ignore this value. You can use the ld_block_size parameter of the structure() constructor to change this value.
Indexing VCF to HDF5 database file VCF: 227971 SNPs; 78927 scaffolds [ ] 0% 0:00:00 | converting VCF to HDF5 This appears to be a denovo assembly, ld_block_size arg is being ignored. [######## ] 43% 0:00:34 | converting VCF to HDF5 This appears to be a denovo assembly, ld_block_size arg is being ignored. [################# ] 87% 0:01:04 | converting VCF to HDF5 This appears to be a denovo assembly, ld_block_size arg is being ignored. [####################] 100% 0:01:12 | converting VCF to HDF5 HDF5: 227971 SNPs; 78928 linkage group SNP database written to ./analysis-vcf2hdf5/subset-only-for-ipyrad.recode.snps.hdf5 Samples: 60 Sites before filtering: 227971 Filtered (indels): 0 Filtered (bi-allel): 21737 Filtered (mincov): 227271 Filtered (minmap): 220599 Filtered (subsample invariant): 35192 Filtered (minor allele frequency): 0 Filtered (combined): 229829 Sites after filtering: 509 Sites containing missing values: 508 (99.80%) Missing values in SNP matrix: 5048 (16.53%) SNPs (total): 509 SNPs (unlinked): 259
struct.mainparams.burnin = 10000 struct.mainparams.numreps = 25000
struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)
[###### ] 30% 0:03:06 | running 80 structure jobs
Good luck!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/464#issuecomment-933629612, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHHIMKVJH6BBJYZRYLDUFHF65ANCNFSM5ESBD3YQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Encountered the same issue due to ipyclient = ipp.Client(cluster_id="ipyrad")
running in a different conda environment with a different Python version.
ipyclient
was running on python 3.10 and structure was running on python 3.7.
I'm running Structure from ipyrad analysis on jupyter (v 4.6.1) notebook (v 5.7.10) on Mac OSX (10.14.6). I've gotten Structure to work multiple times in the past on the same machine, but for some reason this dataset is failing.
Basically, instead of taking ~2 hours to run,
struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)
takes ~10 seconds to run, generates a bunch of temp strfiles, mainparams and extraparams, then ends. No explanation given. It seems obvious the run did not complete, but I'm not sure why.Here's a link to the HDF5 and the ipynb that I'm running.
When I go to extract the evanno table, it fails with an indexing error (understandably). Other than that and the quick run time, I wouldn't have any idea that the run failed. I've tried running different values of k, restarting the notebook, etc. Other Structure runs on different datasets continue to work.
If it's helpful, the vcf used to generate this HDF5 was made by removing a bunch of individuals from the original ipyrad output vcf using
vcftools --remove
and a file of individuals to remove.Any idea what's going on? It might be a useful feature to give an error report when the .run() function fails.