Closed siongkong closed 9 years ago
I have written a patch replacing line 66-106 of countgc.py in order to fix this bug:
def increment_count_dicts(binSize, totalCount, chrom, char, gcCounts, nCounts):
""" Increment read counts in corresponding bin for a necleotide """
# Set bin index
binIndex = int(math.ceil((totalCount) / float(binSize)) - 1.0)
# Initialize new bin in nCounts and gcCounts if (totalCount-1)%binSize == 0
if (totalCount-1)%binSize == 0:
print '\tChr: %s\tInitializing Bin: %s\tTotal: %s' % (chrom, binIndex, totalCount)
nCounts[chrom].append(0.0)
gcCounts[chrom].append(0.0)
# Increment per bin GC count
if char in 'GgCc': # Beware the softmasked fasta files...
gcCounts[chrom][binIndex] += 1
# Increment per bin N count
elif char in 'Nn':
nCounts[chrom][binIndex] += 1
f = open(filename, 'r')
for nextLine in f:
if nextLine[0] == '>':
splitLine = nextLine.split()
chrom = splitLine[0][1:]
if chrom[:3] == 'chr':
chrom = chrom[3:]
print '\nWorking on:\t%s\tPrevious Total Count: %s' % (chrom, totalCount)
totalCount = 0
elif chrom in gcCounts.keys():
rstripLine = nextLine.rstrip()
for char in rstripLine:
totalCount += 1
increment_count_dicts(binSize, totalCount, chrom, char, gcCounts, nCounts)
f.close()
print '\nFinal Total Count: %s' % (totalCount)
Here is an example of running this patched countgc.py, called countgc.fixed.py here:
$ python countgc.fixed.py mock1.fa mock1.gcbin.fixed.pickle -binsize=4
Working on: 1 Previous Total Count: 0
Chr: 1 Initializing Bin: 0 Total: 1
Chr: 1 Initializing Bin: 1 Total: 5
Chr: 1 Initializing Bin: 2 Total: 9
Chr: 1 Initializing Bin: 3 Total: 13
Chr: 1 Initializing Bin: 4 Total: 17
Chr: 1 Initializing Bin: 5 Total: 21
Final Total Count: 21
(wisecondor)[siongkong@skong-c65-01 20150217_github]$ python countgc.fixed.py mock2.fa mock2.gcbin.fixed.pickle -binsize=4
Working on: 1 Previous Total Count: 0
Chr: 1 Initializing Bin: 0 Total: 1
Chr: 1 Initializing Bin: 1 Total: 5
Chr: 1 Initializing Bin: 2 Total: 9
Chr: 1 Initializing Bin: 3 Total: 13
Chr: 1 Initializing Bin: 4 Total: 17
Final Total Count: 20
Here is how the output pickle files look like:
>>> import pickle
>>> with open('mock1.gcbin.fixed.pickle') as fh:
... m1 = pickle.load(fh)
...
>>> with open('mock2.gcbin.fixed.pickle') as fh:
... m2 = pickle.load(fh)
...
>>>
>>> m1['1']
[0.0, 2.0, 1.0, 4.0, 3.0, 0.0]
>>> m2['1']
[0.0, 2.0, 1.0, 4.0, 3.0]
>>>
As can be seen the correct number of bins (6 for mock1.fa and 5 for mock2.fa) as created. Also the last and penultimate bins are properly counted.
The downside of this patch is that the logging information is not as rich as before. However one can easily write a few lines of code to print out the GCC and NC information after the count dicts have been fully populated.
Note that the patch mentioned in the comment above requires python package 'math' to be imported.
Just to let you and other people know, I'm currently very busy with other work but this surely will be fixed in a future release.
I'm sorry for the inconvience!
I believe that I have found a bug in countgc.py. Using two mock FASTA files, mock1.fa and mock2.fa, I can demonstrate that the GC count in the penultimate bin of each chromosome is missing. I have first noticed this problem in the STDOUT of countgc.py where the last and penultimate bin numbers are identical. Could you confirm that this is indeed a bug? In the log below you can see that I have created two mock FASTA files 'mock1.fa', 'mock2.fa' and running them through countgc.py creating output pickle files 'mock1.gcbin.pickle', 'mock2.gcbin.pickle'. In both mock cases one can notice that the nucleotides GGGA with GC count of 3.0 are not included.