VUmcCGP / wisecondor

WISECONDOR (WIthin-SamplE COpy Number aberration DetectOR): Detect fetal trisomies and smaller CNV's in a maternal plasma sample using whole-genome data.
Other
44 stars 65 forks source link

Unexpected binning in countgc.py #15

Closed siongkong closed 9 years ago

siongkong commented 9 years ago

I have found two issues with countgc.py, and this is after the commit in https://github.com/rstraver/wisecondor/issues/14.

The First Issue: Non-integer Type Default Bin Size

Default binsize value of 1000000 is of integer type, not float type. argparse (at least the version 1.2.1 that I am testing) does not convert the int type of a default value into float type. This causes the last bin of a chromosome to be assigned with the wrong bin number computed in L74. argparse however does convert binsize given by a user in the form of an input argument in integer type into float type. This issue can be avoided by a user if the same default value is explicitly given as an input argument when calling countgc.py.

Theoretically this issue affects all but the last entry in a FASTA file, and it also only affects last bin of affected entries when the last bin does not cover the same number of bases as the binsize. In actual use case this means that all relevant chromosomes of interest are affected since there are more entries in a FASTA file after the chromosomes of interest and also there is a one in a million chance for the last bin of a chromosome to have exactly one million bases.

This issue can be demonstrated by the following mock test, which shows that when the default binsize is used the program crashed. When the same default binsize is given as an input argument, and converted by argparse into float type, the program works properly:

$ printf ">chr1\nAAAAAAAAAAATTTTTTTGGGGGGGGGCCCCCCCCC\n" > mock1.fa
>chr1
AAAAAAAAAAATTTTTTTGGGGGGGGGCCCCCCCCC

# Not specifying binsize, using the default value and type
$ python ../wisecondor/countgc.py mock1.fa mock1.gcbin.pickle

Working on: 1
Traceback (most recent call last):
  File "../wisecondor/countgc.py", line 108, in <module>
    finishBin()
  File "../wisecondor/countgc.py", line 78, in finishBin
    gcCounts[key][binNumber] = gcCount
IndexError: list assignment index out of range

# Specifying exactly the same default binsize but argparse will convert it into float type
$ python ../wisecondor/countgc.py mock1.fa mock1.gcbin.pickle -binsize=1000000

Working on: 1
    Chr: 1  Bin: 0  -   GCC: 18.0   NC: 0.0 Total: 36

The Second Issue: Repeated Calling of finishBin Function

finishBin called in L88 when reading the header of a new chromosome should only be called under the same condition (if (start + totalCount) % binSize != 0:) as finishBin is called in L108 after the end of the reference FASTA file. Otherwise when the last bin from the previous chromosome is already full then the GC content of this bin will be set as zero.

In a mock case below I have created two mock chromosomes and each chromosome has the same sequence. For each chromosome there should be 0,2,1,4,3 GC counts in five bins of binsize 4 constructed. However the first chromosome reported only 0,2,1,4,0 GC counts since finishBin was called on the last bin of chr1 twice leading it to be blanked in the second finishBin call.

printf ">chr1\nAATT\nGGAA\nAAAG\nGGGG\nGGGA\n>chr2\nAATT\nGGAA\nAAAG\nGGGG\nGGGA\n" > mock2.fa

cat mock2.fa
>chr1
AATT
GGAA
AAAG
GGGG
GGGA
>chr2
AATT
GGAA
AAAG
GGGG
GGGA

python ../wisecondor/countgc.py mock2.fa mock2.gcbin.pickle -binsize=4
Working on: 1
    Chr: 1  Bin: 0  -   GCC: 0.0    NC: 0.0 Total: 4
    Chr: 1  Bin: 1  -   GCC: 2.0    NC: 0.0 Total: 8
    Chr: 1  Bin: 2  -   GCC: 1.0    NC: 0.0 Total: 12
    Chr: 1  Bin: 3  -   GCC: 4.0    NC: 0.0 Total: 16
    Chr: 1  Bin: 4  -   GCC: 3.0    NC: 0.0 Total: 20
    Chr: 1  Bin: 4  -   GCC: 0.0    NC: 0.0 Total: 20

Working on: 2
    Chr: 2  Bin: 0  -   GCC: 0.0    NC: 0.0 Total: 4
    Chr: 2  Bin: 1  -   GCC: 2.0    NC: 0.0 Total: 8
    Chr: 2  Bin: 2  -   GCC: 1.0    NC: 0.0 Total: 12
    Chr: 2  Bin: 3  -   GCC: 4.0    NC: 0.0 Total: 16
    Chr: 2  Bin: 4  -   GCC: 3.0    NC: 0.0 Total: 20
>>> import pickle
>>> with open('./mock2.gcbin.pickle') as fh:
...     data = pickle.load(fh)
... 
>>> data['1']
[0.0, 2.0, 1.0, 4.0, 0.0]
>>> data['2']
[0.0, 2.0, 1.0, 4.0, 3.0]

Resolution

If a user explicitly specify -binsize=1000000 as an input argument when running countgc.py on an actual hg19 FASTA file, say GRCh37_UCSC.fa (the one that I am using), then the first issue is avoided and the second issue only has one in a million chance of happening since there has to be exactly one million bases covered by the last bin in a chromosome for the second issue to manifest.

rstraver commented 9 years ago

Thanks for finding this and reporting it. I'm committing a fix for these issues right now.

I fixed the first issue by adding a . to the end of the default value, this circumvents the integer interpretation in argparse:

-parser.add_argument('-binsize', type=float, default=1000000,
+parser.add_argument('-binsize', type=float, default=1000000.,

For the second issue, I added the suggested line at L88:

 (if (start + totalCount) % binSize != 0:)

Let me know if this update did indeed fix these bugs (and not introduce new funny behaviour again)