fasterius / VarClust

A Python package for clustering of single nucleotide variants from high-through seqencing data.
Other
5 stars 3 forks source link

[Confusion between Sample and Cell VCFs] - More clarification required! #2

Closed kerimsecener closed 4 years ago

kerimsecener commented 4 years ago

Hey,

I have a directory containing 10 VCF files where each VCF file corresponds to one sample (a collection of cells - 10 samples in total). According to the documentation on GitHub, you mention about VCF files for each sample. But, according to your paper, as far as I understand, you implement this method on VCFs corresponding to individual cells rather than samples ? And the tSNE clustering shown in the paper is indeed a clustering of cells based on their individual SNP profiles ? Is this correct ?

If so, how can I generate VCF files for individual cells in my samples ?

Thanks!

fasterius commented 4 years ago

You can indeed cluster single cells, which is the general idea of the software. A sample in this case (and in the paper) is a single cell, but it doesn't have to be, it can also be used for bulk analyses - hence using "sample" rather than only "single cell". The important thing is that every VCF file you want to analyse only contains one VCF-level sample, i.e. one sample column coming after the FORMAT column. This is in contrast to multi-sample VCFs, which contain multiple sample columns.

To demonstrate, look at the VCF format documentation example, which is a multi-sample VCF file. It contains several sample columns, named NA00001, NA00002 and NA00003. VarClust would only create an SNV profile for the first sample and ignore the rest (given that the VCF file was named NA00001.vcf, which is most likely not the case for multi-sample VCFs).

So, as long as you have a single VCF file for each of your single cells and name them according to the sample name column, you're good to go!

fasterius commented 4 years ago

I have now included more documentation explaining this, which will hopefully make it clearer. Do ask again if it is still unclear!