Roth-Lab / pyclone-vi

Fast method for inferring cancer clonal population structure from SNV data.
GNU General Public License v3.0
49 stars 10 forks source link

Odd errors in larger data sets? #12

Open vortexing opened 3 years ago

vortexing commented 3 years ago

We've been attempting to try out pyclone-vi on our data and we're seeing this weird behavior where it works just fine when we put in like 10-20 variants per sample, but once we put the full list of 300-400 mutations, it balks. We're continuing to troubleshoot to see if it's somehow our HPC or software install environment, but on the off chance this looks familiar to you I thought I'd post the error.

The data input are data from 1 sample at a time, in the right format but there is no tumor content column or error rate column in our datasets. When the script is run, stdout only has: Tumour content column not found. Setting values to 1.0., so we know things are getting to the right place and getting read in to that point, but then we're seeing this (again, only when we do not truncate our input data set to a small number of variants):

Traceback (most recent call last):
  File "/opt/conda/bin/pyclone-vi", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pyclone_vi/cli.py", line 113, in fit
    pyclone_vi.run.fit(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pyclone_vi/run.py", line 29, in fit
    log_p_data, mutations, samples = load_data(in_file, density, num_grid_points, precision=precision)
  File "/opt/conda/lib/python3.8/site-packages/pyclone_vi/data.py", line 11, in load_data
    data, mutations, samples = load_pyclone_data(file_name)
  File "/opt/conda/lib/python3.8/site-packages/pyclone_vi/data.py", line 78, in load_pyclone_data
    cn, mu, log_pi = cn_priors[(
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/generic.py", line 1668, in __hash__
    raise TypeError(
TypeError: 'Series' objects are mutable, thus they cannot be hashed

Any gems? Could we have some sort of file parsing issue for a particular variant name (are there certain characters we can't use in a variant ID)? I feel like this is something silly but can't put my finger on it.

vortexing commented 3 years ago

Oh. My. Gosh. Just FYI, your code breaks if there is a duplicate mutation_id in a sample's dataset. It doesn't FIX the duplicate, just breaks. SUPER minor, but hey, just FYI for ease of use, perhaps a quick filter for uniqueness OR a mention in the docs. ;) I KNEW it felt like something stupid... and it was...