JustinChu / ntsm

This tools counts the number of specific k-mers within sequence data. The counts can then be compare to other counts to determine to compute the probability that sample are of the same origin to discover incongruent samples or sample swaps.
MIT License
19 stars 1 forks source link

Does it work with non-model species? #3

Open DuttaAnik opened 1 month ago

DuttaAnik commented 1 month ago

Hello, Thanks for developing the tool. Does this tool work with non-model species of different ploidy?

JustinChu commented 1 month ago

Unfortunately the code was designed specifically for diploid genomes. The code considers if a site is homozygous or heterozygous, though can handle if missing sites exist too. If you fed in sites with only 2 alleles that have frequencies that are roughly equal (as a hack), it may provide some results, but I cannot guarantee that the results will make sense.

This does have me thinking if we could create a model to handle genomes with a generic ploidy, without sacrificing statistical power.

DuttaAnik commented 1 month ago

Thanks for the reply. Although it is a far-fetched idea, it would be really cool to have this option in this tool along with handling multi-allelic sites. To my knowledge, no good tools are available to detect sample swap in non-model organisms.

JustinChu commented 1 month ago

I would be interested in if the tool gives back any meaningful results in your case if you run it (with the hack). If I were to guess, I think given enough sites with high enough variability, In the worst case I think it will say everything is unrelated so I don't think it would hurt.

DuttaAnik commented 1 month ago

Hi, I have a few questions. First, thanks for fixing the parsing bug. It works now.

So, in this following command: scripts/generateSites name=prefix ref=reference.fa vcf=snps.vcf I should use the multisample VCF file that contains SNPs from all the samples, right?

Then, in this command: ntsmVCF -p prefix -s sites.fa -r reference.fa multiVCF.vcf Should I use the same VCF that I used in the first command? This is a bit confusing. And the sites.fa I assume is created from the first command, right?

Lastly, can I use a list of raw fastq files instead of writing them one by one in the code below? If yes, what should be the format of the list file? Because I have more than 100s of fastq files. ntsmCount -t 2 -s sites.fa sample_part1.fq sample_part2.fq > counts.txt

Thank you.

JustinChu commented 1 month ago

So, in this following command: scripts/generateSites name=prefix ref=reference.fa vcf=snps.vcf I should use the multisample VCF file that contains SNPs from all the samples, right?

Edit*: Actually, the VCF that is used here doesn't need to be a multisample VCF. it just needs the biallelic variants.

Then, in this command: ntsmVCF -p prefix -s sites.fa -r reference.fa multiVCF.vcf Should I use the same VCF that I used in the first command? This is a bit confusing. And the sites.fa I assume is created from the first command, right?

Edit* The multi VCF file here must be a multisample VCF with reliable genotyping results from a reliable set of samples to capture the population structure. It can be but does not have to be is not the same as above. Also, ideally the multisample VCF used should not contain any of the samples used in the sample swap detection process downstream. The sites.fa is correct. I've changed the readme to clarify where sites.fa comes from. I've also added text to mention that using a rotation matrix is optional.

Lastly, can I use a list of raw fastq files instead of writing them one by one in the code below? If yes, what should be the format of the list file? Because I have more than 100s of fastq files. ntsmCount -t 2 -s sites.fa sample_part1.fq sample_part2.fq > counts.txt

At the moment I don't have support for a file list. However, unix glob (i.e. wildcards *) should work. Also, to be clear each sample will need its own count file and thus a separate ntsmCount command.