Closed siongkong closed 9 years ago
Thanks for finding this and reporting it. I'm committing a fix for these issues right now.
I fixed the first issue by adding a . to the end of the default value, this circumvents the integer interpretation in argparse:
-parser.add_argument('-binsize', type=float, default=1000000,
+parser.add_argument('-binsize', type=float, default=1000000.,
For the second issue, I added the suggested line at L88:
(if (start + totalCount) % binSize != 0:)
Let me know if this update did indeed fix these bugs (and not introduce new funny behaviour again)
I have found two issues with countgc.py, and this is after the commit in https://github.com/rstraver/wisecondor/issues/14.
The First Issue: Non-integer Type Default Bin Size
Default binsize value of 1000000 is of integer type, not float type. argparse (at least the version 1.2.1 that I am testing) does not convert the int type of a default value into float type. This causes the last bin of a chromosome to be assigned with the wrong bin number computed in L74. argparse however does convert binsize given by a user in the form of an input argument in integer type into float type. This issue can be avoided by a user if the same default value is explicitly given as an input argument when calling countgc.py.
Theoretically this issue affects all but the last entry in a FASTA file, and it also only affects last bin of affected entries when the last bin does not cover the same number of bases as the binsize. In actual use case this means that all relevant chromosomes of interest are affected since there are more entries in a FASTA file after the chromosomes of interest and also there is a one in a million chance for the last bin of a chromosome to have exactly one million bases.
This issue can be demonstrated by the following mock test, which shows that when the default binsize is used the program crashed. When the same default binsize is given as an input argument, and converted by argparse into float type, the program works properly:
The Second Issue: Repeated Calling of finishBin Function
finishBin called in L88 when reading the header of a new chromosome should only be called under the same condition (
if (start + totalCount) % binSize != 0:
) as finishBin is called in L108 after the end of the reference FASTA file. Otherwise when the last bin from the previous chromosome is already full then the GC content of this bin will be set as zero.In a mock case below I have created two mock chromosomes and each chromosome has the same sequence. For each chromosome there should be 0,2,1,4,3 GC counts in five bins of binsize 4 constructed. However the first chromosome reported only 0,2,1,4,0 GC counts since finishBin was called on the last bin of chr1 twice leading it to be blanked in the second finishBin call.
Resolution
If a user explicitly specify
-binsize=1000000
as an input argument when running countgc.py on an actual hg19 FASTA file, say GRCh37_UCSC.fa (the one that I am using), then the first issue is avoided and the second issue only has one in a million chance of happening since there has to be exactly one million bases covered by the last bin in a chromosome for the second issue to manifest.