JustinChu / ntsm

This tools counts the number of specific k-mers within sequence data. The counts can then be compare to other counts to determine to compute the probability that sample are of the same origin to discover incongruent samples or sample swaps.
MIT License
25 stars 1 forks source link

ntsmEval `-a` flag filters pairs in PCA projection mode #5

Closed YourePrettyGood closed 3 months ago

YourePrettyGood commented 3 months ago

Hi Justin, ntsm is a very neat tool! It's nice to see something faster and lighter than even somalier and works across sequencing platforms.

I was recently testing ntsm (commit 3308a45) out on some Illumina HGDP data (from Bergström et al. 2020 Science) and noticed something a bit wonky: When the -a flag is set and PCA projection mode is enabled with -n [path to centering file] -p [path to rotation matrix], ntsmEval outputs only a subset of the pairs that it should. Specifically, it seems as though the output pairs are those with the lowest scores (the parent-offspring pairs in my limited tests).

The -a flag works as intended if -n and -p are omitted, and setting -s to rather large values (e.g. 1000.0) in PCA projection mode doesn't seem to recover any of the missing pairs even though the scores without PCA projection are far less (i.e. < 3).

The smallest test I was able to run to reproduce this problem uses HGDP00490, HGDP00661, and HGDP00662. HGDP00490 and HGDP00662 are parent-offspring, while any pairs with HGDP00661 are unrelated -- this can be confirmed with somalier on the BAMs and KING on the variant calls.

PCA projection mode filtering out pairs:

ntsmEval -a -t 2 -n ${NTSMCEN} -p ${NTSMPCA} -v counts/{HGDP00490,HGDP00661,HGDP00662}*
Reading count files
Projecting samples onto PCA
Detected 20 components for 96287 sites
sample1 sample2 score   same    dist    relate  ibs0    ibs2    homConcord  het1    het2    sharedHet   hom1    hom2    sharedHom   n   cov1    cov2    errorRate1  errorRate2  miss1   miss2   allHom1 allHom2 allHet1allHet2
counts/HGDP00490_counts.txt counts/HGDP00662_counts.txt 1.192728    3.561103    0.498989    29  69519   0.807026    26738   26706   13384   69486   69518   56135   96224   32.937437   35.598170   -0.001015   -0.002111   52  48  69497   69533   26738   26706
Time: 1.10731 s Memory: 30548 kbytes

No PCA projection mode working as intended:

ntsmEval -a -t 2 -v counts/{HGDP00490,HGDP00661,HGDP00662}*
Reading count files
Performing all-to-all score computation.
Specify -p (--pca) to enable faster comparisons.
sample1 sample2 score   same    dist    relate  ibs0    ibs2    homConcord  het1    het2    sharedHet   hom1    hom2    sharedHom   n   cov1    cov2    errorRate1  errorRate2  miss1   miss2   allHom1 allHom2 allHet1allHet2
counts/HGDP00490_counts.txt counts/HGDP00661_counts.txt 2.428160    -1  0.045859    4852    58387   0.554270    26734   28101   10930   69480   68113   47457   96214   32.937437   48.058357   -0.001015   0.001587    52  53  69497   68132   26738   28102
counts/HGDP00490_counts.txt counts/HGDP00662_counts.txt 1.192728    -1  0.498989    29  69519   0.807026    26738   26706   13384   69486   69518   56135   96224   32.937437   35.598170   -0.001015   -0.002111   52  48  69497   69533   26738   26706
counts/HGDP00661_counts.txt counts/HGDP00662_counts.txt 2.554406    -1  0.024455    5083    57968   0.542949    28101   26702   10819   68115   69514   47149   96216   48.058357   35.598170   0.001587    -0.002111   53  48  68132   69533   28102   26706
Time: 0.223604 s Memory: 26368 kbytes

(The counts files were generated with ntsmCount -t 4 -s ${NTSMSITES} [FASTQs] > [sample ID]_counts.txt run via a small Nextflow pipeline.)

In case it helps for reproducing the issue, I've attached a small table with the relevant FASTQ URLs, file sizes, MD5 checksums, and renamed FASTQs by HGDP sample ID. ntsm_HGDP_PCAall_test_samples.txt

Hopefully there's a simple fix for this. I took a glance through the source, but didn't see any obvious culprit.

Best regards, Patrick Reilly

JustinChu commented 3 months ago

Hi Patrick,

Thanks for the report. Actually, I had no idea for a moment why this would be happening initially. I looked into it and remembered that the -a flag wouldn't work using the PCA heuristic by design, but seeing as I even forgot why this was happening I think I should probably clarify this behaviour in the documentation. In addition I should perhaps also make clear that for a small number of samples (i.e. <100), using the PCA heuristic is unnecessary.

In essence, the PCA heuristic does not bother with pairs it deems too distant in PCA space. When -a is specified with the PCA options, it may only return pairs that exist close in PCA space or are so poor quality it doesn't trust the PCA similarity heuristic to be used. The -a command will not result in any extra computation rather it will report results it would have discarded as not being similar given our main similarity metric. Without the PCA heuristic, all-to-all comparisons are made regardless. HGDP00490 and HGDP00662 were sufficiently similar in PCA space to be considered for a more comprehensive evaluation (likely due to being related), even if deemed actually dissimilar in the end. HGDP00490 was not seen as similar via PCA and was not even considered and thus was not in the output.

If you really wanted it for some reason (maybe for the Euclidean distance value in PCA space in 20 PCs, i.e. the dist value) you'd have to set the PCA heuristic to actually not function properly. That is by specifying -a with the thresholds for missing percent of sites allowed set to 0 (using options -1 0 -2 0) meaning it would consider all datasets to have too much missing data and default to full evaluation for all possible pairs but still compute a distance between the datasets in PCA space.

I hope that makes sense and let me know if have any recommendations.

YourePrettyGood commented 3 months ago

Hi Justin, That totally makes sense! In some of the larger tests I had run, the PCA heuristic only output 1st degree comparisons, so that fits with your explanation. And it makes sense when the primary design goal is detecting sample swaps.

It would definitely be worth clarifying this in the documentation. Something that, at least in my opinion, would be helpful would be to distinguish the use of ntsm for sample swap detection versus relatedness inference, since the latter is demonstrated in Figures 9 and 10 of your preprint. Relatedness inference is a super useful feature, even without the sub-quadratic runtime of the PCA mode. Somalier was already a big jump in efficiency compared to needing VCFs for KING, so ntsm makes the process even more efficient, especially considering the reasonable performance with low depth that you showed.

Anyways, thanks for the clarification! That solves my question, so I'll close the issue.

Best regards, Patrick Reilly