cortes-ciriano-lab / SComatic

A tool for detecting somatic variants in single cell data
Other
163 stars 25 forks source link

Code for columns of the single cell level genotype result #30

Closed lijiang825 closed 11 months ago

lijiang825 commented 1 year ago

Hi,

I was able to install and run SComatic and this is a great tool! I am wondering if you would share a table outlining what each column in the output of the final single cell genotyping step represents? Currently, there are 12 columns, but a subset of them look confusing to me, including "ALT_expected", "Cell_type_expected", "Num_cells_expected", "CB", "Cell_type_observed", "Base_observed", "Num_reads". Thank you very much!

Best, Li

coraolpe commented 12 months ago

Hi, I actually have the same issue. Ran the tool, all good, lovely! Really great documentation too, I was able to run it super fast. But now I'm struggling with this output table. What I want to know is for each barcode, which mutations are there. Would be super helpful to get a bit more guidance.

Francesc-Muyas commented 11 months ago

Dear users, Sorry for the lack of documentation for this output. Please find here a more detailed description of each one of the columns:

Column Description
#CHROM Chromosome carrying the mutation
Start Start genomic coordinate
End End genomic coordinate
REF Reference allele
ALT_expected Alternative allele as described in the input file (--infile)
Cell_type_expected Cell types harbouring the mutation as described in the input file (--infile)
Num_cells_expected Number of expected cells carrying the mutation as described in the input file (--infile)
CB Unique cell barcode analysed
Cell_type_observed Cell type attributed to the analysed CB according to the input metadata file (--meta)
Base_observed Allele observed in this CB
Num_reads Number of reads carrying the _Baseobserved


Let's understand this table with an example. Looking at our SComatic example data, we will focus on the variant site chr10-29559501 and the _SComatic/example_data/results/SingleCellAlleles/Epithelial_cells.single_cellgenotype.tsv file generated.

#CHROM  Start   End REF ALT_expected    Cell_type_expected  Num_cells_expected  CB  Cell_type_observed  Base_observed   Num_reads
chr10   29559501    29559501    A   T   Epithelial_cells    2   AGTCTTTGTGCATCTA    Epithelial_cells    A   5
chr10   29559501    29559501    A   T   Epithelial_cells    2   CCCTCCTAGGCTAGGT    Epithelial_cells    A   1
chr10   29559501    29559501    A   T   Epithelial_cells    2   GGGTCTGTCTTGAGGT    Epithelial_cells    T   2
chr10   29559501    29559501    A   T   Epithelial_cells    2   GTCCTCAAGGCTCATT    Epithelial_cells    T   2
chr10   29559501    29559501    A   T   Epithelial_cells    2   GAGTCCGAGGGTGTTG    Epithelial_cells    A   2

The columns ALT_expected, Cell_type_expected and Num_cells_expected correspond to the values observed in the --infile Example.calling.step2.pass.tsv, so they represent the calls at cell type resolution.

In contrast, the columns CB, Cell_type_observed, Base_observed and Num_reads correspond to the allele observations at unique cell resolution when interrogating the bam files.

Each CB can be presented in the output file in as many rows as different alleles are found per cell, although in most cases, we only observed one allele per cell (so one row per unique CB). In order to find the alleles harbouring the called mutation, we have to look for those rows (unique CBs) where ALT_expeced == Base_observed and Cell_type_expected == Cell_type_observed. In general terms, CBs not accomplishing these conditions can be understood as noise or non-mutated cells.

Thanks, Fran

coraolpe commented 11 months ago

Hi Fran, thank you for the detailed explanation. That makes a lot of sense now. What would be your advice on how to use this info for plotting how many mutations are found in each individual cell?

Francesc-Muyas commented 11 months ago

You could do this by using R or Python.

The basic strategy would be:

  1. Compute how many rows per cell accomplish the ALT_expeced == Base_observed and Cell_type_expected == Cell_type_observed. Basically, the number of mutations per cell.
  2. It is essential to consider the number of callable sites per cell, as it will affect the number of mutations detected. To perform this correction, you will need to compute the number of callable sites per cell using this functionality. You can use these callable sites to compute for example the mutation load per cell and MB (# Mutations per cell / # Callable sites per cell).
  3. Plot the resulting values using your more desired software. I would ignore those cells with a very low number of callable sites.
  4. Generally, you will see an enrichment of cells at 0. This is due to the low number of callable sites per cell in this type of approaches (scRNA-seq) and a low mutation load (depending on the cancer or cell type).

Cheers, Fran

coraolpe commented 11 months ago

Thank you so much! I will attempt to do this today :)