dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Step 6 KeyError with pops file if sample name not in assembly #375

Closed isaacovercast closed 4 years ago

isaacovercast commented 4 years ago

This is somewhat tricky. When is the best time to validate the contents of the pop assign file? Seems like there's not a good time to do it. Also, it would be good to validate not only that all the sample names in the file map to samples in the assembly, but also that all the samples in the assembly are assigned to populations! Otherwise they'll get dropped. This is tricky.

If you have a sample name in your pops file that doesn't exist in your assembly it looks something like this:

  Encountered an Error.
  Message: 'CdC-Lp477_rep'

  Parallel connection closed.
Traceback (most recent call last):
  File "/home/jruela/.conda/envs/env-python37/lib/python3.7/site-packages/ipyrad/core/Parallel.py", line 313, in wrap_run
    self.tool._run(ipyclient=self.ipyclient, **self.rkwargs)
  File "/home/jruela/.conda/envs/env-python37/lib/python3.7/site-packages/ipyrad/core/assembly.py", line 686, in _run
    stepdict[step](self, force, ipyclient).run()
  File "/home/jruela/.conda/envs/env-python37/lib/python3.7/site-packages/ipyrad/assemble/clustmap_across.py", line 44, in __init__
    self.assign_groups()
  File "/home/jruela/.conda/envs/env-python37/lib/python3.7/site-packages/ipyrad/assemble/clustmap_across.py", line 139, in assign_groups
    self.cgroups[idx] = [self.data.samples[x] for x in val[1]]
  File "/home/jruela/.conda/envs/env-python37/lib/python3.7/site-packages/ipyrad/assemble/clustmap_across.py", line 139, in <listcomp>
    self.cgroups[idx] = [self.data.samples[x] for x in val[1]]
KeyError: 'CdC-Lp477_rep'
isaacovercast commented 4 years ago

Ok, the ipyrad/core/assembly.py _link_populations() function now checks the samples in the pop_assign_file against the samples in the assembly and complains if they are different:

ipyrad.assemble.utils.IPyradError: 
    The sample names in the assembly disagree with sample names in the
    pop_assign_file. Sample names in the pop_assign_file must exactly match
    sample names in the assembly, and you must specify a population for each
    sample in the assembly.

    Names in the pop_assign_file that do not appear in the assembly:
        ['3M_0']

    Samples in the assembly that are not specified in the pop_assign_file:
        ['1A_0']

The consequence is now if you branch and remove samples you're going to need to create a new pops file for each branch. I think this is okay because it forces you to be explicit, whereas in the current state samples may be silently dropped from the final output if they are mislabeled in the pop_assign_file, which is bad. It's impossible for us to simultaneously allow for sample names to be missing from the pop_assign_file (allowing for branching using the same pops file), and to guarantee that samples aren't getting dropped because of a misspelled sample name. Better to err on the side of caution and have an informative error message.

magdalenengeve commented 4 years ago

Hi, i am trying to use ipyrad for the first time and i am running into some issues. I Have multiple libraries, originally sequenced on 4 lanes, but I had concatenated my lanes together so I just had 1 file for each library. About 27 libraries in total. I followed the procedure here: https://github.com/dereneaton/ipyrad/blob/master/docs/tutorial-combining-data.rst I got all my libraries demultiplexed independently and them successfully merged them together so that I can run steps 2, 3,4,5,6,7, on all individuals at the same time (total of about 3100 individuals). However, When I try to proceed with running the steps after demultplexing and merging, It fails with the following error: loading Assembly: all from saved path: /lustre/mngeve/rapturedata/CATDATA/ipyrad/all.json ipyrad.assemble.utils.IPyradError: The sample names in the assembly disagree with sample names in the pop_assign_file. Sample names in the pop_assign_file must exactly match sample names in the assembly, and you must specify a population for each sample in the assembly.

Names in the pop_assign_file that do not appear in the assembly:
    []

Samples in the assembly that are not specified in the pop_assign_file:
    [all my sample names]

So basically it is saying that all my samples are not in the pop_assignment file, which is totally wrong. I used the same file to make my barcode file as well as my pop asignment files for each library.... and I am certain that sample names match exactly in my pop assignment file as well as in my barcode file. So I cannot understand this error at all. Please, could you help me out? what am I doing that is wrong?

isaacovercast commented 4 years ago

Can you post your files? or like copy/paste the first 5-10 lines of each file?

isaacovercast commented 4 years ago

Also, can you post the exact command you ran and your params file?

magdalenengeve commented 4 years ago

Oh, Sorry, I figured everything out. I thought the merging step will create a new pop assignment file, instead it asked my new/merged param file to call the pop assignment file of 1 single library (x) of my 36 libraries. The merged.json file also had that same pop assignment file of library x as the pop assignment file. I don't know if this is what happens for everyone. Anyway, I (1) made a merged pop assignment file, (2) edited my merged.json and params_merged files and (3). repeated the run and it didnt return any error. I hope this helps someone. I wish the -m flag would merge the different pop assignment files of all libraries into a new one for the new param file it creates and the .json file it creates to make life easier. Thanks!