getzlab / deTiN

DeTiN is designed to measure tumor-in-normal contamination and improve somatic variant detection sensitivity when using a contaminated matched control.
BSD 3-Clause "New" or "Revised" License
49 stars 21 forks source link

Missing required input fields #15

Closed erleholgersen closed 6 years ago

erleholgersen commented 6 years ago

Hi again Amaro,

Thanks for all your help so far! I've now successfully run deTiN, but I ran into a few errors with missing input fields (not mentioned on the Wiki) that I figured I'd report here.

First, I got an error from the mutation statistics file:

Error reading call stats skipping first two rows and trying again
Traceback (most recent call last):
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 588, in <module>
    main()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 518, in main
    di.read_and_preprocess_data()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 216, in read_and_preprocess_data
    self.read_and_preprocess_SSNVs()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 196, in read_and_preprocess_SSNVs
    self.read_call_stats_file()
  File "/scratch/DBC/BCRBIOIN/SHARED/software/deTiN/20180816/deTiN/deTiN.py", line 111, in read_call_stats_file
    comment='#', skiprows=2, usecols=fields, dtype=fields_type)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1749, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/home/breakthr/eholgersen/.local/lib/python2.7/site-packages/pandas-0.23.4-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1134, in _validate_usecols_names
    "columns expected but not found: {missing}".format(missing=missing)
ValueError: Usecols do not match columns, columns expected but not found: ['alt_allele', 't_ref_sum', 'n_alt_count', 'tumor_name', 'normal_name', 'n_ref_count', 'judgement', 't_alt_sum', 't_alt_count', 'position', 'contig', 'ref_allele', 't_ref_count', 'failure_reasons']

Adding dummy columns t_ref_sum and t_alt_sum fixed this issue. I used MuTect2 rather than MuTect to call variants, and thus had to assemble my own input file rather than using a pre-made call_stats file.

The other error I got was was from the aSCNA segmentation file:

changing header of seg file from Start to Start.bp
changing header of seg file from End to End.bp
missing required header n_probes and could not replace with any one of alternates

I fixed this by adding a column n_probes to the input, set equal to Num_SNPs from the Allelic CNV output (I wasn't sure if I should use Num_SNPs or Num_Targets?)

Thanks again!

amarotaylor commented 6 years ago

Hey Erle,

Sorry for the errors I will add your headers and fix the wiki to reflect the code. For clarification Num_Targets is the correct equivalent to n_probes. I just pushed a fix that removes the requirement for the ref sum and alt sum columns and will automatically fix the header for the seg file.

Thanks for pointing these out!

Best Amaro

Diogopell commented 5 years ago

Hi, I was getting stuck on the same problem, as I read in the "Description of inputs" page on the wiki, those columns shouldn't be required, and I got the impression that they aren't used in the code at all.

I eddited one line at the "read_call_stats_file" funtion on deTiN/deTiN.py as bellow: def read_call_stats_file(self):

on 'fields' I remove several field requirements

    # they weren't on wiki's "Description of inputs" and apparently weren't actually being used.
    fields = ['contig', 'position', 
    't_alt_count', 't_ref_count' , 'n_alt_count', 'n_ref_count', 'failure_reasons', 'judgement']
    #... continues as normal

It appears to fix those problems for good.

amarotaylor commented 5 years ago

Hi it seems like from the most recent version you just removed the alt/ref alleles and tumor and normal sample names? You're right these aren't required by deTiN so that should work just fine. I just include them because they are useful later on.

fields = ['contig', 'position', 'ref_allele', 'alt_allele', 'tumor_name', 'normal_name', 't_alt_count', 't_ref_count' , 'n_alt_count', 'n_ref_count', 'failure_reasons', 'judgement']