Open MaelLefeuvre opened 1 year ago
Hi Mael! Thank you so much for your bug discoveries! I have been unable to focus on these for a few months for personal reasons, but I have kept an eye on this and would gladly have a Zoom call with you to discuss them, and your fork. If you're up for it, please drop me an email to the address in TKGWV2's Readme contact details, and we can set something up! Cheers!
Hello Daniel,
I'm sorry to announce I discovered quite the evasive bug/quirk in
bam2plink.R
(which might be related to issue #2 ). But I'm hoping I found a simple workaround which would easily dovetail within your existing codebase, while preserving users from unknowingly getting corrupted results.Bug description
The issue seems to arises whenever a lexicographically sorted reference genome is used to align the input samples (as in what is found within Ensembl' GRCh37).
A. When such is the case, the command located in bam2plink.R:84, i.e.:
...will, as expected, print pileup lines in lexicographical order (since
samtools
will always follow the chromosome order that is found in the samfile's header).B. It then follows that, when subsequently parsing the generated pileup file using pileup2ped.py will also produce a lexicographically sorted
.ped
fileHowever, the subsequent command, located in bam2plink.R:92, i.e.:
will on the other hand always output a set of numerically sorted [.ped|.map] files, regardless of how this input was previously generated. (Since
plink-1.9
will always numerically encode chromosomes and reorder them using numeric sort when creating binary filesets).This creates a situation where:
rsids
, genomic coordinate, and reference allele information found in<sample>.map
is numerically sorted (e.g.:chr 1 2 3 4 5 6 7 8 9 ...
)<sample>.ped
is lexicographically sorted (e.g.:chr 1 10 11 12 ... 20 21 22 3 4 5 ...
)Or, in other terms, this results in a scrambled output, where
rsids
and genome coordinates do not match the genotypes anymore.Consequences
Once
pileup2tkrelated.R
takes over, the program will merge sample genotypes with allele frequencies according to row index. Since the genotype order within thesample1____sample2.tped
file is actually scrambled, the effects of this bug are two-folds:.frq
file with the genotype information, and will in most cases falsely consider these positions as triallelic during stepQC4
. Thus causing a tremendous loss of information.Reproducing the bug using a minimal reproductible example
I've constructed a dummy dataset along with a set of instructions, which can be found here: tkgwv2-mre.
Applying the procedure within this example-repo results in the following output:
However, since
dummy-numeric-*.sam
anddummy-lexicographic-*.sam
samples are in fact exact copies (appart from being ordered differently), the expected output should instead be:bam2plink.R
pileup2ped.py
. e.g.:dummy-numeric-1 dummy-numeric-1 0 0 0 1 C C G G T T
dummy-lexicographic-1 dummy-lexicographic-1 0 0 0 1 C C T T G G
Here we can notice that, as explained above, the allele order of the
.ped
files follows the one found within the input.sam
|.bam
file, while the order of the companion.map
always follows a numeric order.Fixing the bug:
pileup2ped.py
Note the version sort argument (
-V
) ensures the pileup gets properly sorted, even in the presence ofchr
prefixes within the chromosome tags.Adding a simple sanity check to ensure correct sort order.
Looking deeper, another place where a similar problem may arise is located in plink2tkrelated.R:161-166: The following code block...
Inherently assumes the input
freqFile.frq
is sorted in the same way ascomm<sampleA>____<sampleB>.tped
. While this event is unlikely to happen (givenplink-1.9
' behavior), I would still highly recommend proceeding to a very simple sanity check just after binding these two data-frames. Adding a clause in the likes of...... at plink2tkrelated.R:167, would at least ensure the binded dataset is properly sorted before attempting to compute an HRC, and notify users of the discrepancy instead of outputing corrupted results.
Finally, adding a warning regarding this matter within your README might help guard users from this silent error state.
I dearly hope my descriptions were clear enough and not too convoluted. Hope these suggestions will be of help.
I've included the described modifications in my personal fork of your project (See my previous PR: tkgwv2/pull@5 and the corresponding repository).
I don't know If you're accepting contributions and Pull Requests (since my previous one is still pending), but I'd be glad to submit a separate one, resolving this specific issue, if it would be of help.
Finally, know that I'm happy to discuss the subject in more detail, or to clarify some points if needed. In the meantime, I wish you a very pleasant day.
Cheers!