Closed YourePrettyGood closed 3 months ago
Hi Patrick,
Thanks for the report. Actually, I had no idea for a moment why this would be happening initially. I looked into it and remembered that the -a
flag wouldn't work using the PCA heuristic by design, but seeing as I even forgot why this was happening I think I should probably clarify this behaviour in the documentation. In addition I should perhaps also make clear that for a small number of samples (i.e. <100), using the PCA heuristic is unnecessary.
In essence, the PCA heuristic does not bother with pairs it deems too distant in PCA space. When -a
is specified with the PCA options, it may only return pairs that exist close in PCA space or are so poor quality it doesn't trust the PCA similarity heuristic to be used. The -a
command will not result in any extra computation rather it will report results it would have discarded as not being similar given our main similarity metric. Without the PCA heuristic, all-to-all comparisons are made regardless. HGDP00490
and HGDP00662
were sufficiently similar in PCA space to be considered for a more comprehensive evaluation (likely due to being related), even if deemed actually dissimilar in the end. HGDP00490
was not seen as similar via PCA and was not even considered and thus was not in the output.
If you really wanted it for some reason (maybe for the Euclidean distance value in PCA space in 20 PCs, i.e. the dist
value) you'd have to set the PCA heuristic to actually not function properly. That is by specifying -a
with the thresholds for missing percent of sites allowed set to 0 (using options -1 0 -2 0
) meaning it would consider all datasets to have too much missing data and default to full evaluation for all possible pairs but still compute a distance between the datasets in PCA space.
I hope that makes sense and let me know if have any recommendations.
Hi Justin, That totally makes sense! In some of the larger tests I had run, the PCA heuristic only output 1st degree comparisons, so that fits with your explanation. And it makes sense when the primary design goal is detecting sample swaps.
It would definitely be worth clarifying this in the documentation. Something that, at least in my opinion, would be helpful would be to distinguish the use of ntsm for sample swap detection versus relatedness inference, since the latter is demonstrated in Figures 9 and 10 of your preprint. Relatedness inference is a super useful feature, even without the sub-quadratic runtime of the PCA mode. Somalier was already a big jump in efficiency compared to needing VCFs for KING, so ntsm makes the process even more efficient, especially considering the reasonable performance with low depth that you showed.
Anyways, thanks for the clarification! That solves my question, so I'll close the issue.
Best regards, Patrick Reilly
Hi Justin, ntsm is a very neat tool! It's nice to see something faster and lighter than even somalier and works across sequencing platforms.
I was recently testing ntsm (commit 3308a45) out on some Illumina HGDP data (from Bergström et al. 2020 Science) and noticed something a bit wonky: When the
-a
flag is set and PCA projection mode is enabled with-n [path to centering file] -p [path to rotation matrix]
,ntsmEval
outputs only a subset of the pairs that it should. Specifically, it seems as though the output pairs are those with the lowest scores (the parent-offspring pairs in my limited tests).The
-a
flag works as intended if-n
and-p
are omitted, and setting-s
to rather large values (e.g. 1000.0) in PCA projection mode doesn't seem to recover any of the missing pairs even though the scores without PCA projection are far less (i.e. < 3).The smallest test I was able to run to reproduce this problem uses HGDP00490, HGDP00661, and HGDP00662. HGDP00490 and HGDP00662 are parent-offspring, while any pairs with HGDP00661 are unrelated -- this can be confirmed with somalier on the BAMs and KING on the variant calls.
PCA projection mode filtering out pairs:
No PCA projection mode working as intended:
(The counts files were generated with
ntsmCount -t 4 -s ${NTSMSITES} [FASTQs] > [sample ID]_counts.txt
run via a small Nextflow pipeline.)In case it helps for reproducing the issue, I've attached a small table with the relevant FASTQ URLs, file sizes, MD5 checksums, and renamed FASTQs by HGDP sample ID. ntsm_HGDP_PCAall_test_samples.txt
Hopefully there's a simple fix for this. I took a glance through the source, but didn't see any obvious culprit.
Best regards, Patrick Reilly