VUmcCGP / wisecondor

WISECONDOR (WIthin-SamplE COpy Number aberration DetectOR): Detect fetal trisomies and smaller CNV's in a maternal plasma sample using whole-genome data.
Other
44 stars 65 forks source link

Missing Penultimate Bin in countgc.py #14

Closed siongkong closed 9 years ago

siongkong commented 9 years ago

I believe that I have found a bug in countgc.py. Using two mock FASTA files, mock1.fa and mock2.fa, I can demonstrate that the GC count in the penultimate bin of each chromosome is missing. I have first noticed this problem in the STDOUT of countgc.py where the last and penultimate bin numbers are identical. Could you confirm that this is indeed a bug? In the log below you can see that I have created two mock FASTA files 'mock1.fa', 'mock2.fa' and running them through countgc.py creating output pickle files 'mock1.gcbin.pickle', 'mock2.gcbin.pickle'. In both mock cases one can notice that the nucleotides GGGA with GC count of 3.0 are not included.

$ printf ">chr1\nAATT\nGGAA\nAAAG\nGGGG\nGGGA\nA\n" > mock1.fa
$ printf ">chr1\nAATT\nGGAA\nAAAG\nGGGG\nGGGA\n" > mock2.fa

$ cat mock1.fa
>chr1
AATT
GGAA
AAAG
GGGG
GGGA
A

$ cat mock2.fa
>chr1
AATT
GGAA
AAAG
GGGG
GGGA

$ python countgc.py mock1.fa mock1.gcbin.pickle -binsize=4
$ python countgc.py mock2.fa mock2.gcbin.pickle -binsize=4

Working on:     1
        Chr: 1  Bin: 0  -       GCC: 0.0        NC: 0.0 Total: 4
        Chr: 1  Bin: 1  -       GCC: 2.0        NC: 0.0 Total: 8
        Chr: 1  Bin: 2  -       GCC: 1.0        NC: 0.0 Total: 12
        Chr: 1  Bin: 3  -       GCC: 4.0        NC: 0.0 Total: 16
        Chr: 1  Bin: 4  -       GCC: 3.0        NC: 0.0 Total: 20
        Chr: 1  Bin: 4  -       GCC: 0.0        NC: 0.0 Total: 21

$ python countgc.py mock2.fa mock2.gcbin.pickle -binsize=4

Working on:     1
        Chr: 1  Bin: 0  -       GCC: 0.0        NC: 0.0 Total: 4
        Chr: 1  Bin: 1  -       GCC: 2.0        NC: 0.0 Total: 8
        Chr: 1  Bin: 2  -       GCC: 1.0        NC: 0.0 Total: 12
        Chr: 1  Bin: 3  -       GCC: 4.0        NC: 0.0 Total: 16
        Chr: 1  Bin: 4  -       GCC: 3.0        NC: 0.0 Total: 20
        Chr: 1  Bin: 4  -       GCC: 0.0        NC: 0.0 Total: 20
>>> import pickle
>>> with open('mock1.gcbin.pickle') as fh:
...     m1 = pickle.load(fh)
...
>>> with open('mock2.gcbin.pickle') as fh:
...     m2 = pickle.load(fh)
...
>>> m1['1']
[0.0, 2.0, 1.0, 4.0, 0.0]
>>> m2['1']
[0.0, 2.0, 1.0, 4.0, 0.0]
>>>
siongkong commented 9 years ago

I have written a patch replacing line 66-106 of countgc.py in order to fix this bug:

def increment_count_dicts(binSize, totalCount, chrom, char, gcCounts, nCounts):
    """ Increment read counts in corresponding bin for a necleotide """
    # Set bin index
    binIndex = int(math.ceil((totalCount) / float(binSize)) - 1.0)
    # Initialize new bin in nCounts and gcCounts if (totalCount-1)%binSize == 0
    if (totalCount-1)%binSize == 0:
        print '\tChr: %s\tInitializing Bin: %s\tTotal: %s' % (chrom, binIndex, totalCount)
        nCounts[chrom].append(0.0)
        gcCounts[chrom].append(0.0)
    # Increment per bin GC count
    if char in 'GgCc': # Beware the softmasked fasta files...
        gcCounts[chrom][binIndex] += 1
    # Increment per bin N count
    elif char in 'Nn':
        nCounts[chrom][binIndex] += 1

f = open(filename, 'r')
for nextLine in f:
    if nextLine[0] == '>':
        splitLine = nextLine.split()
        chrom   = splitLine[0][1:]
        if chrom[:3] == 'chr':
            chrom = chrom[3:]
        print '\nWorking on:\t%s\tPrevious Total Count: %s' % (chrom, totalCount)
        totalCount = 0
    elif chrom in gcCounts.keys():
        rstripLine = nextLine.rstrip()
        for char in rstripLine:
            totalCount += 1
            increment_count_dicts(binSize, totalCount, chrom, char, gcCounts, nCounts)
f.close()

print '\nFinal Total Count: %s' % (totalCount)

Here is an example of running this patched countgc.py, called countgc.fixed.py here:

$ python countgc.fixed.py mock1.fa mock1.gcbin.fixed.pickle -binsize=4

Working on: 1   Previous Total Count: 0
    Chr: 1  Initializing Bin: 0 Total: 1
    Chr: 1  Initializing Bin: 1 Total: 5
    Chr: 1  Initializing Bin: 2 Total: 9
    Chr: 1  Initializing Bin: 3 Total: 13
    Chr: 1  Initializing Bin: 4 Total: 17
    Chr: 1  Initializing Bin: 5 Total: 21

Final Total Count: 21
(wisecondor)[siongkong@skong-c65-01 20150217_github]$ python countgc.fixed.py mock2.fa mock2.gcbin.fixed.pickle -binsize=4

Working on: 1   Previous Total Count: 0
    Chr: 1  Initializing Bin: 0 Total: 1
    Chr: 1  Initializing Bin: 1 Total: 5
    Chr: 1  Initializing Bin: 2 Total: 9
    Chr: 1  Initializing Bin: 3 Total: 13
    Chr: 1  Initializing Bin: 4 Total: 17

Final Total Count: 20

Here is how the output pickle files look like:

>>> import pickle
>>> with open('mock1.gcbin.fixed.pickle') as fh:
...     m1 = pickle.load(fh)
... 
>>> with open('mock2.gcbin.fixed.pickle') as fh:
...     m2 = pickle.load(fh)
... 
>>> 
>>> m1['1']
[0.0, 2.0, 1.0, 4.0, 3.0, 0.0]
>>> m2['1']
[0.0, 2.0, 1.0, 4.0, 3.0]
>>> 

As can be seen the correct number of bins (6 for mock1.fa and 5 for mock2.fa) as created. Also the last and penultimate bins are properly counted.

The downside of this patch is that the logging information is not as rich as before. However one can easily write a few lines of code to print out the GCC and NC information after the count dicts have been fully populated.

siongkong commented 9 years ago

Note that the patch mentioned in the comment above requires python package 'math' to be imported.

rstraver commented 9 years ago

Just to let you and other people know, I'm currently very busy with other work but this surely will be fixed in a future release.

I'm sorry for the inconvience!