getzlab / SignatureAnalyzer

Updated SignatureAnalyzer-GPU with mutational spectra & RNA expression compatibility.
MIT License
71 stars 21 forks source link

Forcing "chr" prefix on chromosomes causes reference-MAF mismatches #14

Closed edawson closed 4 years ago

edawson commented 4 years ago

Hi,

While trying to run the latest commit and the latest PIP version, I kept getting missing key errors:

signatureanalyzer -t maf -n 4 --hg_build /aztlan/refs/Homo_sapiens_assembly19.2bit --cosmic cosmic3_ID rebc356_thca48.16MAR2020.maf
Unable to init server: Could not connect: Connection refused
Unable to init server: Could not connect: Connection refused

(signatureanalyzer:24793): Gdk-CRITICAL **: 10:34:16.643: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
/home/eric/.local/lib/python3.6/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.neighbors.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
---------------------------------------------------------
---------- S I G N A T U R E  A N A L Y Z E R  ----------
---------------------------------------------------------
   * Using Homo_sapiens_assembly19 build
   * Using cosmic3_ID signatures
   * Loading spectra from rebc356_thca48.16MAR2020.maf
sys:1: DtypeWarning: Columns (78,79,83,87,88,92,93,94,100,102,103,108,110,114,115,116,118,120,135,136,137,138,139,140,141,142,143,144,149,160,161,162,163,168,177) have mixed types.Specify dtype option on import or set low_memory=False.
      * Mapping contexts: 0 / 23933Traceback (most recent call last):
  File "/home/eric/.local/bin/signatureanalyzer", line 11, in <module>
    load_entry_point('signatureanalyzer', 'console_scripts', 'signatureanalyzer')()
  File "/home/eric/getzlab-SignatureAnalyzer/signatureanalyzer/__main__.py", line 181, in main
    **vars(args)
  File "/home/eric/getzlab-SignatureAnalyzer/signatureanalyzer/signatureanalyzer.py", line 87, in run_maf
    cosmic=cosmic
  File "/home/eric/getzlab-SignatureAnalyzer/signatureanalyzer/spectra.py", line 194, in get_spectra_from_maf
    _context = hg[chromosome][pos - 1 + del_len:pos - 1 + del_len * 6].upper()
KeyError: 'chr1'

(please excuse the X server errors - I don't know why it's complaining about not having X forwarding but it doesn't seem to impact anything.)

I traced it down to this line: https://github.com/broadinstitute/getzlab-SignatureAnalyzer/blob/bdab04164fa27b6bd9fbf3b344e8b3772b7fe2ef/signatureanalyzer/spectra.py#L98

It looks like the "chr" prefix gets appended to the MAF but not the reference 2bit chromosome names. I was able to successfully run by removing this line and an identical one later in the file.

I'm happy to make a PR to remove these lines, but I assume they were put there for a purpose. Maybe because they need to be coerced to strings and numpy is trying to interpret them as ints?

Generally I try to avoid enforcing prefixes with references/VCFs/MAFs because it's easy to get mismatches like this. Every center has their own way of resolving them, and in enforcing one way or the other I have usually created more problems than I started with.

I assume this would also lead to the MAF or spectra output having the "chr" prefix, or are they chomped off before final output?

jcha40 commented 4 years ago

Hello,

get_spectra_from_maf should now check the hg build file for the "chr" prefix and only attach it if necessary.