Publication: https://doi.org/10.1101/2023.11.01.565041
This tool counts the number of specific k-mers within sequence data. The counts can then be compared to other counts to determine and compute the probability that samples are of the same origin to discover incongruent samples or sample swaps. It is intended to be run before any analysis and can provide additional QC information like sequencing error rate.
By default, this tool will only return pairs of samples with the same origin. This tool can theoretically also be used for relatedness inference using the -a
parameter, however in this case the PCA-based heuristic should not be used (see below).
General:
For generating site fasta files given a VCF file
If cloning directly from the repository run:
./autogen.sh
Compiling should be as easy as:
./configure && make
To install in a specified directory:
./configure --prefix=/PATH && make install
Given a VCF file and a reference genome, you can produce fasta files with k-mers that one can use to create fingerprinting. We have provided a set of human data based on similar criteria found in SNP microarrays.
The VCF file in this stage can be a single sample VCF, it just needs the variants.
Example:
scripts/generateSites name=sites ref=reference.fa vcf=snps.vcf
Creates fasta files referred to as sites_n{min missing sub k-mers}.fa
below (but name can be changed by specific another name
). By default all non C/G <-> A/T conversions are ignored.
Parameters:
w=31 #window size to consider sequences in this region
k=19 #kmer size used in the window region
t=4 #threads for any subprocess or tools
The sites fasta file generated is a set of k-mers for each site. n
for each fasta file generated refers to the number of missing k-mers allowed after filtering out repetitive k-mers (e.g. n=0
would mean you need each site to have all possible k-mers). The max value of n
is w - k
(so with default parameters 12). Small values of n
will help you handle sequencing errors, but you need to have enough sites for our statistical analysis to work. In our testing retaining around ~10^5 sites seemed to work well, though if you end up with slightly less or a lot more ntsm should still function well.
If you do not wish to select your own sites, we currently include data/human_sites_n10.fa
a fasta file with selected 96287 sites adequate for sample swap detection for human samples in the data
folder.
Once fasta files for sites have been created, it is possible to create a PCA rotation matrix for speeding up the analysis. To do so you must supply a multiVCF file from which the PCA will be built. This multi-sample VCF ideally should not contain the same samples as the VCF used in the sample swap detection process. It should be a set of reliable samples on which a PCA and rotational matrix would be based on. We note that the use of a rotational matrix is optional.
Example:
ntsmVCF -p prefix -s sites.fa -r reference.fa multiVCF.vcf
scripts/convertTSVtoPCA.py -p prefix -m prefix_matrix.tsv
Again if you are working with human samples and do not wish to generate your own, we currently include data/human_sites_rotationMat.tsv
and human_sites_center.txt
to use in our PCA-based heurstic. We based our PCA and rotation matrix on 3202 samples from the 1000 Genomes Project.
Using this set of k-mers we can then count all of these k-mers within a fastq file. Files may be gzipped and multiple threads can be used. Each sample needs a separate run of this command and its own count files.
Example:
ntsmCount -t 2 -s sites.fa sample_part1.fq sample_part2.fq > counts.txt
Creates count file using 2 threads. A sliding window using 19-mers is used in this case and the highest count in the window is recorded.
If your files are unsorted and have massive coverage, you may also intentionally run less reads using the -m
parameter:
ntsmCount -t 2 -m 10 -s sites.fa sample_part1.fq sample_part2.fq > counts.txt
This will run the file until the average site coverage reaches 10x, which should be adequate for most sequencing data types. Lower coverage is possible if the read error rate is low enough.
Output Example:
#@TK 119443488624
#@KS 19
#locusID countAT countCG sumAT sumCG distinctAT distinctCG
rs1741692 23 22 68 190 3 9
rs6419870 0 43 0 86 1 2
rs3171927 19 20 91 20 5 1
rs12057128 16 0 31 0 2 5
rs11976368 43 0 43 0 1 10
rs4545798 17 13 34 65 2 5
rs10888802 0 37 0 355 9 10
...
Header lines (#@
) helps in error rate estimation.
Example command:
ntsmEval HG002_rep1_counts.txt HG002_rep2_counts.txt HG003_counts.txt HG004_counts.txt > summary.tsv
or if you wish optionally to speed up the analysis using a PCA rotation matrix:
ntsmEval -a -t 16 -n data/human_sites_center.txt -p data/human_sites_rotationMat.tsv HG002_rep1_counts.txt HG002_rep2_counts.txt HG003_counts.txt HG004_counts.txt > summary.tsv
Output Example (with -a
option):
sample1 sample2 score same dist relate ibs0 ibs2 homConcord het1 het2 sharedHet hom1 hom2 sharedHom n cov1 cov2 errorRate1 errorRate2 miss1 miss2 allHom1 allHom2 allHet1 allHet2
HG002_rep1_counts.txt HG002_rep2_counts.txt 0.07988 1 0.004839 0.996827 0 95971 0.998287 33720 33787 33613 62532 62465 62358 96252 37.416162 45.260554 0.003493 0.004301 35 35 62532 62465 33720 33787
HG003_counts.txt HG004_counts.txt 3.430842 0 7.512569 -0.003973 6649 48672 0.355549 33473 33781 13165 62772 62464 35507 96245 44.931787 44.068285 0.005968 0.004208 34 38 62779 62466 33474 33783
HG002_rep1_counts.txt HG003_counts.txt 1.660803 0 4.732675 0.498775 24 62518 0.731288 33720 33474 16744 62528 62774 45774 96248 37.416162 44.931787 0.003493 0.005968 35 34 62532 62779 33720 33474
HG002_rep1_counts.txt HG004_counts.txt 1.653478 0 2.872071 0.500089 19 62525 0.72982 33720 33783 16901 62525 62462 45624 96245 37.416162 44.068285 0.003493 0.004208 35 38 62532 62466 33720 33783
HG002_rep2_counts.txt HG003_counts.txt 1.760081 0 4.707002 0.499821 24 62521 0.73156 33787 33474 16779 62461 62774 45742 96248 45.260554 44.931787 0.004301 0.005968 35 34 62465 62779 33787 33474
HG002_rep2_counts.txt HG004_counts.txt 1.74858 0 2.78644 0.4996 19 62488 0.729034 33787 33783 16916 62458 62462 45572 96245 45.260554 44.068285 0.004301 0.004208 35 38 62465 62466 33787 33783
...
Note that the -a
parameter will not output all pairwise comparisons if the PCA heuristic is used and will only consider comparing samples within a certain distance in PCA space. Though related samples are more likely to exist closer in PCA space, if using ntsm for relatedness inference you may miss some related pairs if you use this heuristic. You would not likely want to use -a
with the PCA heuristic in most cases and the example above is for more illustrative purposes of the tool's use and output.
Column explanations:
If run on a single counts file the output can look like this:
sample cov errorRate miss hom het PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20
HG002_rep1_counts.txt 37.416162 0.003493 35 62532 33720 -14.254352 -23.285693 -0.373179 -7.315873 -1.187992 -5.494577 -0.434657 -0.334589 -0.832297 -1.160507 0.286102 -0.114464 1.013333 0.252766 -0.204102 0.465836 0.694361 0.099620 0.019345 -1.195279
This provides generic QC information useful without other samples (e.g. error rate) and if inclined a means of plotting the sample relative to others on a PCA plot. The number of columns is dependant on the number of principal components used.