adoebley / Griffin

A flexible framework for nucleosome profiling of cell-free DNA
Other
24 stars 16 forks source link

griffin_GC_counts.py fails on non ATCGN nucleotides in the reference #3

Closed willhooper closed 2 years ago

willhooper commented 2 years ago

Hi,

When running griffin_GC_counts.py on a GRCh38 aligned BAM, I encountered the following error:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/nfs/sw/snakemake/snakemake-5.4.5/python/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/nfs/sw/snakemake/snakemake-5.4.5/python/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/gpfs/commons/home/whooper/software/Griffin/scripts/griffin_GC_counts.py", line 198, in collect_reads
    fragment_seq = fragment_seq.astype(int)
ValueError: invalid literal for int() with base 10: 'M'
"""

Looking at the code, it seems that it doesn't properly handle the full range of IUPAC codes that are in the reference: https://www.bioinformatics.org/sms/iupac.html

Thanks, Will

adoebley commented 2 years ago

Hi Will,

Thanks for letting me know about this issue, I encountered this same problem recently and implemented a fix in the updated version that I'm working on. I'll hopefully have the github updated to the newer version within the next month or so.

Best, Anna-Lisa

adoebley commented 2 years ago

Hi Will,

The github is now updated and this issue should be fixed. Let me know if you continue to encounter problems.

-Anna-Lisa