danimfernandes / tkgwv2

An ancient DNA relatedness pipeline for ultra-low coverage whole genome shotgun data
GNU General Public License v2.0
6 stars 2 forks source link

[bug][bam2plink.R] Using TKGWV2 with samples aligned on a lexicographically sorted reference genome causes silent bug and outputs corrupted results. #6

Open MaelLefeuvre opened 1 year ago

MaelLefeuvre commented 1 year ago

Hello Daniel,

I'm sorry to announce I discovered quite the evasive bug/quirk in bam2plink.R (which might be related to issue #2 ). But I'm hoping I found a simple workaround which would easily dovetail within your existing codebase, while preserving users from unknowingly getting corrupted results.

Bug description

The issue seems to arises whenever a lexicographically sorted reference genome is used to align the input samples (as in what is found within Ensembl' GRCh37).

A. When such is the case, the command located in bam2plink.R:84, i.e.:

commA = paste0("samtools mpileup -Q ",minBQ," -q ",minMQ," -B -f ",referenceGenome," -l ",gwvList," ",i," > ",sid,".pileupsamtools.gwv.txt")

...will, as expected, print pileup lines in lexicographical order (since samtools will always follow the chromosome order that is found in the samfile's header).

B. It then follows that, when subsequently parsing the generated pileup file using pileup2ped.py will also produce a lexicographically sorted .ped file

However, the subsequent command, located in bam2plink.R:92, i.e.:

comm3 = paste0("plink --bfile ",gwvPlink," --extract range ",sid,".plinkRange --recode --out ",sid," --allow-no-sex")

will on the other hand always output a set of numerically sorted [.ped|.map] files, regardless of how this input was previously generated. (Since plink-1.9 will always numerically encode chromosomes and reorder them using numeric sort when creating binary filesets).

This creates a situation where:

  1. the rsids, genomic coordinate, and reference allele information found in <sample>.map is numerically sorted (e.g.: chr 1 2 3 4 5 6 7 8 9 ...)
  2. the genotype information of the sample, located in <sample>.ped is lexicographically sorted (e.g.: chr 1 10 11 12 ... 20 21 22 3 4 5 ...)

Or, in other terms, this results in a scrambled output, where rsids and genome coordinates do not match the genotypes anymore.

Consequences

Once pileup2tkrelated.R takes over, the program will merge sample genotypes with allele frequencies according to row index. Since the genotype order within the sample1____sample2.tped file is actually scrambled, the effects of this bug are two-folds:

  1. The program will first match the reference allele of the .frq file with the genotype information, and will in most cases falsely consider these positions as triallelic during step QC4. Thus causing a tremendous loss of information.
  2. For the remaining SNPs, the program will then apply Queller & Goodnight' algorithm using scrambled allele frequencies. Thus, outputing a biased estimation of relatedness.

Reproducing the bug using a minimal reproductible example

I've constructed a dummy dataset along with a set of instructions, which can be found here: tkgwv2-mre.

Applying the procedure within this example-repo results in the following output:

Sample1                Sample2                Used_SNPs  HRC     counts0 counts4 Relationship
dummy-numeric-1        dummy-numeric-2        3          0.0633  1       2       2nd degree
dummy-lexicographic-1  dummy-lexicographic-2  1          NA      0       0       NA

However, since dummy-numeric-*.sam and dummy-lexicographic-*.sam samples are in fact exact copies (appart from being ordered differently), the expected output should instead be:

Sample1                Sample2                Used_SNPs  HRC     counts0 counts4 Relationship
dummy-numeric-1        dummy-numeric-2        3          0.0633  1       2       2nd degree
dummy-lexicographic-1  dummy-lexicographic-2  3          0.0633  1       2       2nd degree
The bug can be seen by comparing the output of bam2plink.R pileup2ped.py. e.g.: fileset contents
dummy-numeric-1.ped dummy-numeric-1 dummy-numeric-1 0 0 0 1 C C G G T T
dummy-lexicographic-1.ped dummy-lexicographic-1 dummy-lexicographic-1 0 0 0 1 C C T T G G

Here we can notice that, as explained above, the allele order of the .ped files follows the one found within the input .sam|.bam file, while the order of the companion .map always follows a numeric order.

Fixing the bug:

  1. Adding the following lines right below bam2plink.R:85 ensures the output of samtools follows a numeric sort order before feeding to pileup2ped.py
    # ---- Sort pileup in numerical order (using version-sort)
    commB = paste0("sort -Vo ", sid, ".pileupsamtools.gwv.txt -k1 -k2 ", sid, ".pileupsamtools.gwv.txt")
    system(commB, ignore.stderr = TRUE)

    Note the version sort argument (-V) ensures the pileup gets properly sorted, even in the presence of chr prefixes within the chromosome tags.

Adding a simple sanity check to ensure correct sort order.

Looking deeper, another place where a similar problem may arise is located in plink2tkrelated.R:161-166: The following code block...

    if((length(pairLoadNew$V1) == length(alFreqNew$CHR)) == TRUE) {
      pairLoadNew$A1Mi = alFreqNew$A1
      pairLoadNew$A2Ma = alFreqNew$A2
      pairLoadNew$al1freq = alFreqNew$MAF
      pairLoadNew$al2freq = alFreqNew$AFa2
    }

Inherently assumes the input freqFile.frq is sorted in the same way as comm<sampleA>____<sampleB>.tped. While this event is unlikely to happen (given plink-1.9' behavior), I would still highly recommend proceeding to a very simple sanity check just after binding these two data-frames. Adding a clause in the likes of...

if (!all(pairLoadNew$V1 == alFreqNew$CHR)) {
  stop("Invalid chromosome sort order between allele frequencies and output .tped. Ensure your input dataset is numerically sorted across chromosomes and try again.")
}

... at plink2tkrelated.R:167, would at least ensure the binded dataset is properly sorted before attempting to compute an HRC, and notify users of the discrepancy instead of outputing corrupted results.

Finally, adding a warning regarding this matter within your README might help guard users from this silent error state.


I dearly hope my descriptions were clear enough and not too convoluted. Hope these suggestions will be of help.

I've included the described modifications in my personal fork of your project (See my previous PR: tkgwv2/pull@5 and the corresponding repository).

I don't know If you're accepting contributions and Pull Requests (since my previous one is still pending), but I'd be glad to submit a separate one, resolving this specific issue, if it would be of help.

Finally, know that I'm happy to discuss the subject in more detail, or to clarify some points if needed. In the meantime, I wish you a very pleasant day.

Cheers!

danimfernandes commented 1 year ago

Hi Mael! Thank you so much for your bug discoveries! I have been unable to focus on these for a few months for personal reasons, but I have kept an eye on this and would gladly have a Zoom call with you to discuss them, and your fork. If you're up for it, please drop me an email to the address in TKGWV2's Readme contact details, and we can set something up! Cheers!