Closed lijiang825 closed 11 months ago
Hi, I actually have the same issue. Ran the tool, all good, lovely! Really great documentation too, I was able to run it super fast. But now I'm struggling with this output table. What I want to know is for each barcode, which mutations are there. Would be super helpful to get a bit more guidance.
Dear users, Sorry for the lack of documentation for this output. Please find here a more detailed description of each one of the columns:
Column | Description |
---|---|
#CHROM | Chromosome carrying the mutation |
Start | Start genomic coordinate |
End | End genomic coordinate |
REF | Reference allele |
ALT_expected | Alternative allele as described in the input file (--infile ) |
Cell_type_expected | Cell types harbouring the mutation as described in the input file (--infile ) |
Num_cells_expected | Number of expected cells carrying the mutation as described in the input file (--infile ) |
CB | Unique cell barcode analysed |
Cell_type_observed | Cell type attributed to the analysed CB according to the input metadata file (--meta ) |
Base_observed | Allele observed in this CB |
Num_reads | Number of reads carrying the _Baseobserved |
Let's understand this table with an example. Looking at our SComatic example data, we will focus on the variant site chr10-29559501 and the _SComatic/example_data/results/SingleCellAlleles/Epithelial_cells.single_cellgenotype.tsv file generated.
#CHROM Start End REF ALT_expected Cell_type_expected Num_cells_expected CB Cell_type_observed Base_observed Num_reads
chr10 29559501 29559501 A T Epithelial_cells 2 AGTCTTTGTGCATCTA Epithelial_cells A 5
chr10 29559501 29559501 A T Epithelial_cells 2 CCCTCCTAGGCTAGGT Epithelial_cells A 1
chr10 29559501 29559501 A T Epithelial_cells 2 GGGTCTGTCTTGAGGT Epithelial_cells T 2
chr10 29559501 29559501 A T Epithelial_cells 2 GTCCTCAAGGCTCATT Epithelial_cells T 2
chr10 29559501 29559501 A T Epithelial_cells 2 GAGTCCGAGGGTGTTG Epithelial_cells A 2
The columns ALT_expected
, Cell_type_expected
and Num_cells_expected
correspond to the values observed in the --infile Example.calling.step2.pass.tsv
, so they represent the calls at cell type resolution.
In contrast, the columns CB
, Cell_type_observed
, Base_observed
and Num_reads
correspond to the allele observations at unique cell resolution when interrogating the bam files.
Each CB can be presented in the output file in as many rows as different alleles are found per cell, although in most cases, we only observed one allele per cell (so one row per unique CB). In order to find the alleles harbouring the called mutation, we have to look for those rows (unique CBs) where ALT_expeced == Base_observed
and Cell_type_expected == Cell_type_observed
. In general terms, CBs not accomplishing these conditions can be understood as noise or non-mutated cells.
Thanks, Fran
Hi Fran, thank you for the detailed explanation. That makes a lot of sense now. What would be your advice on how to use this info for plotting how many mutations are found in each individual cell?
You could do this by using R or Python.
The basic strategy would be:
ALT_expeced == Base_observed
and Cell_type_expected == Cell_type_observed
. Basically, the number of mutations per cell. Cheers, Fran
Thank you so much! I will attempt to do this today :)
Hi,
I was able to install and run SComatic and this is a great tool! I am wondering if you would share a table outlining what each column in the output of the final single cell genotyping step represents? Currently, there are 12 columns, but a subset of them look confusing to me, including "ALT_expected", "Cell_type_expected", "Num_cells_expected", "CB", "Cell_type_observed", "Base_observed", "Num_reads". Thank you very much!
Best, Li