Structure fails without explanation

alexkrohn commented 2 years ago

I'm running Structure from ipyrad analysis on jupyter (v 4.6.1) notebook (v 5.7.10) on Mac OSX (10.14.6). I've gotten Structure to work multiple times in the past on the same machine, but for some reason this dataset is failing.

Basically, instead of taking ~2 hours to run, struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True) takes ~10 seconds to run, generates a bunch of temp strfiles, mainparams and extraparams, then ends. No explanation given. It seems obvious the run did not complete, but I'm not sure why.

Here's a link to the HDF5 and the ipynb that I'm running.

When I go to extract the evanno table, it fails with an indexing error (understandably). Other than that and the quick run time, I wouldn't have any idea that the run failed. I've tried running different values of k, restarting the notebook, etc. Other Structure runs on different datasets continue to work.

If it's helpful, the vcf used to generate this HDF5 was made by removing a bunch of individuals from the original ipyrad output vcf using vcftools --remove and a file of individuals to remove.

Any idea what's going on? It might be a useful feature to give an error report when the .run() function fails.

alexkrohn commented 2 years ago

More info on the code that I'm running. I'm not sure if I could replicate this without my original dataset, mostly becuase this same code works on other datasets.

import ipyrad.analysis as ipa
import toyplot
import pandas as pd

# Load the data
data = "analysis-vcf2hdf5/structure-subset.snps.hdf5"

# Group into populations. 

# Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')

imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()

# Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}

# init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure(
    name="mate-subset-60inds",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.80,
)

#Samples: 60
#Sites before filtering: 227971
#Filtered (bi-allel): 21737
#Filtered (mincov): 227271
#Filtered (minmap): 220599
#Filtered (combined): 227337
#Sites after filtering: 634
#Sites containing missing values: 631 (99.53%)
#Missing values in SNP matrix: 6240 (16.40%) 

struct.mainparams.burnin = 10000
struct.mainparams.numreps = 25000

struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)

#Parallel connection | Alexs-MacBook-Air.local: 4 cores
#[####################] 100% 0:00:15 | running 80 structure jobs

etable = struct.get_evanno_table([2,3,4,5])
etable

Which leads to the first error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-8-4d8b0eaddda6> in <module>
----> 1 etable = struct.get_evanno_table([2,3,4,5])
      2 etable

//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in get_evanno_table(self, kvalues, max_var_multiple, quiet)
    569                 raise ValueError('max_variance_multiplier must be >1')
    570 
--> 571         table = _get_evanno_table(self, kvalues, max_var_multiple, quiet)
    572         return table
    573 

//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in _get_evanno_table(self, kpops, max_var_multiple, quiet)
    930     for kpop in kpops:
    931         ## concat results for k=x
--> 932         reps, excluded = _concat_reps(self, kpop, max_var_multiple, quiet)
    933 
    934         ## report if some results were excluded

//anaconda/envs/py36/lib/python3.6/site-packages/ipyrad/analysis/structure.py in _concat_reps(self, kpop, max_var_multiple, quiet, **kwargs)
    890             min_var_across_reps = np.min([i.var_lnlik for i in reps])
    891         else:
--> 892             min_var_across_reps = reps[0].var_lnlik
    893 
    894         ## iterate over reps

IndexError: list index out of range

isaacovercast commented 2 years ago

Hi Alex, this will only happen if you have structure installed but it's not set as executable. ipyrad.analysis.structure tests for the presence of the structure binary and raises an error which helpfully suggests how to install this if it doesn't exist. The structure module tests with os.path.exists, so if the binary exists then it passes this check. Later, when you call run(), if the binary exists, but is not set to be executable then the call will fail. I added a check for executability of the binaries (9a59482) that explains how to fix the mode using chmod. This will go out in the next conda package, but for now you'll have to check and fix the file permissions on the structure binary by hand. Good luck.

alexkrohn commented 2 years ago

Interesting. These same exact commands worked on this machine previously, but not with this dataset. (The dataset later worked on another machine.) I'll keep playing around with it to see if I can figure out why the permissions were wrong during this instance, but worked fine in others.

isaacovercast commented 2 years ago

Does it work now? Does it work on the exact same computer with a different dataset? The dataset used shouldn't change the behavior in this way if my hypothesis is correct. Are you certain that your jupyter notebook is running in the expected conda env (which has ipyrad and structure installed)? The same command can work differently on the same computer if you are running jupyter inside different conda environments.

alexkrohn commented 2 years ago

It still does not work. The same code works on the same computer with a different dataset, both initiated from the same conda environment with ipyrad 0.9.58 running. I figured it was something to do with the VCF file (maybe because it was generated from vcftools), but I tried the same dataset on another machine just running structure (not through ipyrad or jupyter notebook), and that worked. I'm stumped.

On Mon, Oct 4, 2021 at 9:32 AM Isaac Overcast @.***> wrote:

Does it work now? Does it work on the exact same computer with a different dataset? The dataset used shouldn't change the behavior in this way if my hypothesis is correct. Are you certain that your jupyter notebook is running in the expected conda env (which has ipyrad and structure installed)? The same command can work differently on the same computer if you are running jupyter inside different conda environments.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/464#issuecomment-933490483, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHA5I5UDNKDRLGLWN6DUFGUG5ANCNFSM5ESBD3YQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

isaacovercast commented 2 years ago

The vcf file is imported and converted to hdf5 very early in the process of initializing the structure object, and yours seems to be importing fine. Any problems with the VCF format would arise at this stage, so I'm very dubious its a problem with the vcf file. What is the name of the conda environment you're running in? Can you show me conda env list? Can you do this:

conda activate <your_ipyrad_environment>
which structure
ls -l /path/to/your/structure/binary

Of course you'll need to put the name of your conda env there, and also the actual path, not the fake path I put as a standin.

isaacovercast commented 2 years ago

Can you do this:

ps -ef | grep notebook

alexkrohn commented 2 years ago

Sure. I installed ipyrad in a Python 3.6 environment on my Mac, running OS 10.14.6 Screen Shot 2021-10-04 at 9 58 30 AM

alexkrohn commented 2 years ago

And with the notebook actually runnin Screen Shot 2021-10-04 at 10 01 03 AM g

isaacovercast commented 2 years ago

Mm. How about this: which ipcluster

alexkrohn commented 2 years ago

Screen Shot 2021-10-04 at 10 09 34 AM

isaacovercast commented 2 years ago

Well, it certainly all looks like it should work.... The next thing to try is running the structure command by hand outside the notebook. The command we build inside the structure module looks like this:

    cmd = [
        STRUCTURE,
        "-m", mname,
        "-e", ename,
        "-K", str(kpop),
        "-D", str(seed),
        "-N", str(ntaxa),
        "-L", str(nsites),
        "-i", sname,
        "-o", outname,
    ]

So if you can build this command by hand on the command line and run it (inside the py36 environment) then this should tell us the problem. The -m, -e, and -i parameters should indicate the temporary files created inside your structure analysis directory, like this:

structure -m structure-analysis/tmp-watdo-3-1.mainparams.txt -e structure-analysis/tmp-watdo-3-1.extraparams.txt -K 3 -d 1234 -i   structure-analysis/tmp-watdo-3-1.strfile.txt

The rest of the parameters should be straightforward. Run this and let me know what the output is.

isaacovercast commented 2 years ago

You can also email me your vcf file and your "inds-and-pops-subset.txt" file.

alexkrohn commented 2 years ago

I emailed the vcf and population assignment file.

Structure fails because there is no NUMIND or NUMLOCI line in the mainparams file.

tmp-mate-subset-60inds-3-1.mainparams.txt Screen Shot 2021-10-04 at 10 33 22 AM

isaacovercast commented 2 years ago

I should have been more specific about completing the rest of the command line arguments, my apologies. The NUMIND and NUMLOCI error are because you didn't include the -N and -L command line arguments (these can be written to the file or provided on the command line, which is how ipyrad does it).

isaacovercast commented 2 years ago

The vcf file you sent works for me:

though I wonder about the vcf file because it is quite different in terms of number of snps than what is reported above. Did you send me a subset of the vcf file you originally posted about?

isaacovercast commented 2 years ago

I suspect this is a 'mac' thing, since i'm running on linux. Can you try adding the -N and -L arguments to the structure command as you had it above to see if it gets any further?

The other thing to try is updating to the most recent version of ipyrad (0.9.81) there's a non-zero probability that this problem you are reporting has already been fixed, if you're working with an older version.

alexkrohn commented 2 years ago

I sent you the full vcf via email, sorry for the wrong file.

I used this command to run Structure:

structure -m tmp-mate-subset-60inds-3-1.mainparams.txt -e tmp-mate-subset-60inds-3-1.extraparams.txt -K 3 -d 1234 -i tmp-mate-subset-60inds-3-1.strfile.txt -N 60 -L 634

634 is the total number of sites remaining after filtering, not the total in the HDF5. Structure gives this error, which I'm not sure how to interpret. Why would it think there should be 1482 sites?

Screen Shot 2021-10-04 at 11 13 34 AM

Would updating ipyrad just involve running conda update ipyrad while in the environment that has ipyrad installed? Or do I have to reinstall ipyrad completely using conda?

isaacovercast commented 2 years ago

The new file also works for me:

conda install -c bioconda ipyrad --force-reinstall

If it isn't an ipyrad version issue then its certainly a mac issue, we don't see many mac users, so bugs tend to hang around longer.

alexkrohn commented 2 years ago

Interesting. I updated ipyrad and am still having the same issue. I guess it must be a Mac issue. Luckily other datasets still work with this code, so if I find myself in this situation again, maybe I'll have better luck figuring out what causes this behavior. Thanks a lot for trying to work though it with me.

isaacovercast commented 2 years ago

It's not a mac thing. It must be something goofy with your environment. I did a clean conda install on a mac using the data you sent and it works fine:

cd ~
rm -rf miniconda3
bash Miniconda*
conda create -n ipyrad python=3.8
conda activate ipyrad
conda install -c conda-forge -c bioconda ipyrad
conda install structure clumpp -c ipyrad

import ipyrad.analysis as ipa
import toyplot
import pandas as pd

# Load the data
#data = "subset-and-08maxmissing.recode.vcf"
data = "subset-only-for-ipyrad.recode.vcf"

# Group into populations. 

# Import from csv, then convert to dictionary
df = pd.read_csv("inds-and-pops-subset.txt", sep=',')

imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()

# Require that x% of the samples in each pop have data
minmap = {i: 0.001 for i in imap}

# init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples
struct = ipa.structure(
    name="mate-subset-60inds",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.80,
)

Converting vcf to HDF5 using default ld_block_size: 20000
Typical RADSeq data generated by ipyrad/stacks will ignore this value.
You can use the ld_block_size parameter of the structure() constructor to change
this value.

Indexing VCF to HDF5 database file
VCF: 227971 SNPs; 78927 scaffolds
[                    ]   0% 0:00:00 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[########            ]  43% 0:00:34 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[#################   ]  87% 0:01:04 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[####################] 100% 0:01:12 | converting VCF to HDF5 
HDF5: 227971 SNPs; 78928 linkage group
SNP database written to ./analysis-vcf2hdf5/subset-only-for-ipyrad.recode.snps.hdf5
Samples: 60
Sites before filtering: 227971
Filtered (indels): 0
Filtered (bi-allel): 21737
Filtered (mincov): 227271
Filtered (minmap): 220599
Filtered (subsample invariant): 35192
Filtered (minor allele frequency): 0
Filtered (combined): 229829
Sites after filtering: 509
Sites containing missing values: 508 (99.80%)
Missing values in SNP matrix: 5048 (16.53%)
SNPs (total): 509
SNPs (unlinked): 259

struct.mainparams.burnin = 10000
struct.mainparams.numreps = 25000

struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)

[######              ]  30% 0:03:06 | running 80 structure jobs

Good luck!

alexkrohn commented 2 years ago

Very weird indeed! I'll keep digging. Thanks again for going through it in such depth with me.

On Mon, Oct 4, 2021 at 12:03 PM Isaac Overcast @.***> wrote:

It's not a mac thing. It must be something goofy with your environment. I did a clean conda install on a mac using the data you sent and it works fine:

cd ~ rm -rf miniconda3 bash Miniconda* conda create -n ipyrad python=3.8 conda activate ipyrad conda install -c conda-forge -c bioconda ipyrad conda install structure clumpp -c ipyrad

import ipyrad.analysis as ipa import toyplot import pandas as pd

Load the data

data = "subset-and-08maxmissing.recode.vcf"

data = "subset-only-for-ipyrad.recode.vcf"

Group into populations.

Import from csv, then convert to dictionary

df = pd.read_csv("inds-and-pops-subset.txt", sep=',')

imap = df.groupby('pop.for.pca.structure')['ind'].apply(list).to_dict()

Require that x% of the samples in each pop have data

minmap = {i: 0.001 for i in imap}

init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples

struct = ipa.structure( name="mate-subset-60inds", data=data, imap=imap, minmap=minmap, mincov=0.80, )

Converting vcf to HDF5 using default ld_block_size: 20000 Typical RADSeq data generated by ipyrad/stacks will ignore this value. You can use the ld_block_size parameter of the structure() constructor to change this value.

Indexing VCF to HDF5 database file VCF: 227971 SNPs; 78927 scaffolds [ ] 0% 0:00:00 | converting VCF to HDF5 This appears to be a denovo assembly, ld_block_size arg is being ignored. [######## ] 43% 0:00:34 | converting VCF to HDF5 This appears to be a denovo assembly, ld_block_size arg is being ignored. [################# ] 87% 0:01:04 | converting VCF to HDF5 This appears to be a denovo assembly, ld_block_size arg is being ignored. [####################] 100% 0:01:12 | converting VCF to HDF5 HDF5: 227971 SNPs; 78928 linkage group SNP database written to ./analysis-vcf2hdf5/subset-only-for-ipyrad.recode.snps.hdf5 Samples: 60 Sites before filtering: 227971 Filtered (indels): 0 Filtered (bi-allel): 21737 Filtered (mincov): 227271 Filtered (minmap): 220599 Filtered (subsample invariant): 35192 Filtered (minor allele frequency): 0 Filtered (combined): 229829 Sites after filtering: 509 Sites containing missing values: 508 (99.80%) Missing values in SNP matrix: 5048 (16.53%) SNPs (total): 509 SNPs (unlinked): 259

struct.mainparams.burnin = 10000 struct.mainparams.numreps = 25000

struct.run(nreps=20, kpop=[2,3,4,5], force = True, auto=True)

[###### ] 30% 0:03:06 | running 80 structure jobs

Good luck!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/464#issuecomment-933629612, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHHIMKVJH6BBJYZRYLDUFHF65ANCNFSM5ESBD3YQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

kindofausername commented 4 months ago

Encountered the same issue due to ipyclient = ipp.Client(cluster_id="ipyrad") running in a different conda environment with a different Python version.

ipyclient was running on python 3.10 and structure was running on python 3.7.

dereneaton / ipyrad

Structure fails without explanation #464

Load the data

data = "subset-and-08maxmissing.recode.vcf"

Group into populations.

Import from csv, then convert to dictionary

Require that x% of the samples in each pop have data

init analysis object with input data and (optional) parameter options. Mincov is the proportion of SNPs shared across all samples