some issue with the sequencing

riccardobergamin commented 9 months ago

Occasionaly It happens that some mutation that arises late in the evolution has a vaf around 0.5, which is bit unrealistic, since i am using very high coverage (150). For instance, considering

Searching in the forest i find

where the birth time is quite high if we look at the forest

Also we are not imposing the infinite sites assumption, since the same mutation appeared in 2 different cell ids. Should we impose it? Tipically people assume it in literature.

albertocasagrande commented 9 months ago

Can you post the full code example, please?

riccardobergamin commented 9 months ago

 library(rRACES)
  library(ggplot2)
  library(dplyr)
  library(ggpubr)
  library(pbapply)
  library(easypar)

 reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/current/",
                          "fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
                          "dna.chromosome.22.fa.gz")

  SBS_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
                    "COSMIC_v3.4_SBS_GRCh37.txt")

  drivers_url <- paste0("https://raw.githubusercontent.com/",
                        "caravagnalab/rRACES/main/inst/extdata/",
                        "driver_mutations_hg19.csv")

  passenger_CNAs_url <- paste0("https://raw.githubusercontent.com/",
                               "caravagnalab/rRACES/main/inst/extdata/",
                               "passenger_CNAs_hg19.csv")

  germline_url <- paste0("https://www.dropbox.com/scl/fi/g9oloxkip18tr1r",
                         "m6wjve/germline_data_demo.tar.gz?rlkey=15jshul",
                         "d3bqgyfcs7fa0bzqeo&dl=1")

  m_engine <- build_mutation_engine(directory = "Test",
                                    reference_src = reference_url,
                                    SBS_src = SBS_url,
                                    drivers_src = drivers_url,
                                    passenger_CNAs_src = passenger_CNAs_url,
                                    germline_src = germline_url)

  m_engine$add_mutant(mutant_name = "A",
                      passenger_rates = list("+" = c(SNV = 8e-7), 
                                             "-" = c(SNV = 8e-7)),
                      driver_SNVs = c(), driver_CNAs = c())

  m_engine$add_exposure(c(SBS1=0.2, SBS5=0.8))

   sim <- new(Simulation,
               seed = sample(x = 1:1000,size = 1),
               save_snapshot = F)

    sim$duplicate_internal_cells <- T

    sim$update_tissue("Liver", 2e3, 2e3)

    # plastic 

    sim$add_mutant(name = "A",
                   epigenetic_rates = c("+-" = 0.005, "-+" = 0.001),
                   growth_rates = c("+" = 2.3,"-" =  1),
                   death_rates = c("+" = 0.0, "-" = 0.0))
    sim$place_cell(starting_cell, 1e3, 1e3)

    sim$run_up_to_size("A+",1e5)

    n_w <- n_h <- 50
    ncells <- 0.8*n_w*n_h
    bbox <- sim$search_sample(c("A" = ncells), n_w, n_h)

    sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)

    forest <- sim$get_samples_forest()

   phylo_forest <- m_engine$place_mutations(forest,500)

  seq_results <-
    simulate_seq(
      phylo_forest,
      coverage = 150,
      epi_FACS =  T,
      write_SAM = F
    )

albertocasagrande commented 9 months ago

@riccardobergamin, the previous code does not produce the issue mentioned by the first comment. Can you provide a full example that does exactly what you wrote?

albertocasagrande commented 9 months ago

@riccardobergamin and @caravagn, I figured out that the above behavior is due to the unbalanced number of + and - cells.

When seq_results is called with epi_FACS = TRUE, each of the samples collected by sample_cells (in this case, only the sample "Sampling") is split into two sub-samples {sample name}_N and {sample_name}_P. The former contains all the cells in the original sample with epigenetic state -, while the latter those with epigenetic state +. The sequencing is then independently simulated over all the sub-samples with the same coverage (in our case, 150x per sub-sample).

Let us assume that Sampling_N and Sampling_P contain 10 and 990 cells, respectively. If a mutation appears in all of the cells in Sampling_N in a single allele, then the column Sampling_N.VAF of that mutation will be 0.5 even if it does not appear in any of the 990 cells in Sampling_P.

You probably were expecting a VAF taking into account both + and - cells.

Should the coverage refers to the single sub-sample or the whole set of collected sample? Should the number of reads per sub-sample preserve the ratio between the number of cells in the sub-samples? What there are many samples (e.g., "Sampling" and "Sampling 2")?

caravagn commented 9 months ago

Wait a second. Your implementation of +/- is correct.

@riccardobergamin did you realise it was in the + population? I presume yes but, I feel, you still think that VAF .5 is too high as the mutation should be subclonal in the +?

307884983-75593c14-73fc-4a48-8614-97a78249bde8

albertocasagrande commented 9 months ago

@caravagn, I could not analyze the case depicted in the plot because the seed was randomly selected in the interval [1,1000].

However, I fixed the seed to 26, and, for any SNV whose VAF is greater than 0.3 in "Sampling-", I computed the ratio between the number of cells in "Sampling-" that have the SNV and all of them. The minimum among these ratios is 0.5, which seems acceptable considering the random coverage fluctuations.

I did the same for "Sampling+," and I got 0.4388, which, once more, seems to be ok. You can find the full test code below.

@riccardobergamin, if you find a seed for which you find something different, please let me know.

library("rRACES")
library("dplyr")

# tissue evolution

sim <- new(Simulation, seed = 26, save_snapshot = FALSE)

sim$duplicate_internal_cells <- TRUE

sim$update_tissue("Liver", 2e3, 2e3)

sim$add_mutant(name = "A",
               epigenetic_rates = c("+-" = 0.005, "-+" = 0.001),
               growth_rates = c("+" = 2.3,"-" =  1),
               death_rates = c("+" = 0.0, "-" = 0.0))
sim$place_cell("A+", 1e3, 1e3)

sim$run_up_to_size("A+", 1e5)

# sampling tissue

n_w <- n_h <- 50
ncells <- 0.8 * n_w * n_h
bbox <- sim$search_sample(c("A" = ncells), n_w, n_h)

sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)

forest <- sim$get_samples_forest()

# placing mutations

m_engine <- build_mutation_engine(setup_code="demo")

m_engine$add_mutant(mutant_name = "A",
                    passenger_rates = list("+" = c(SNV = 8e-7),
                                           "-" = c(SNV = 8e-7)),
                    driver_SNVs = c(), driver_CNAs = c())

m_engine$add_exposure(c(SBS1 = 0.2, SBS5 = 0.8))

phylo_forest <- m_engine$place_mutations(forest, 500)

# sequencing simulation

seq_results <-
  simulate_seq(
    phylo_forest,
    coverage = 150,
    epi_FACS =  TRUE,
    write_SAM = FALSE
  )

# testing VAF

for (sub_sample in list(c("-","Sampling_N.VAF"), c("+","Sampling_P.VAF"))) {

  # let us consider pre-neoplastic and passenger-only mutations in "Sampling-"
  # with VAF greater than 0.3
  non_germ_large_VAF <- seq_results %>%
    filter(!grepl("germinal", .data$classes), .data[[sub_sample[2]]] > 0.3)

  # extract sampled cells in "Sampling{sub_sample[1]}"
  sampled_cells <- phylo_forest$get_nodes() %>%
    filter(epistate == sub_sample[1], !is.na(.data$sample))

  # extract SNVs of sampled cells in "Sampling-"
  sampled_cell_SNVs <- unique(phylo_forest$get_sampled_cell_SNVs() %>%
                                filter(cell_id %in% sampled_cells[,1]))

  min_ratio <- NA
  for (row in seq_len(nrow(non_germ_large_VAF))) {
    # build the snv associated with the VAF
    snv <- SNV(chromosome = non_germ_large_VAF[row, "chromosome"],
               pos_in_chr = non_germ_large_VAF[row, "chr_pos"],
               alt = non_germ_large_VAF[row, "alt"],
               ref = non_germ_large_VAF[row, "ref"])

    # find the id of the cells having the snv
    cell_ids <- (sampled_cell_SNVs %>%
                   filter(chromosome == non_germ_large_VAF[row, "chromosome"],
                          chr_pos == non_germ_large_VAF[row, "chr_pos"],
                          ref == non_germ_large_VAF[row, "ref"],
                          alt == non_germ_large_VAF[row, "alt"]))["cell_id"]

    # compute the ratio between the number of cells in the sampled cells in
    # "Sampling-" that have the snv and all of them
    ratio <- nrow(cell_ids) / nrow(sampled_cells)

    # store the minimum among these ratios
    if (is.na(min_ratio) || min_ratio > ratio) {
      min_ratio <- ratio
    }
  }

  print(paste0("The minimum ratio for \"", sub_sample[2], "\" is ", min_ratio))

}

caravagn commented 9 months ago

It's impossible to debug non-reproducible code..

riccardobergamin commented 9 months ago

@albertocasagrande, try with this example:

library(rRACES) 
library(ggplot2)
library(dplyr)

reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/current/",
                        "fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
                        "dna.chromosome.22.fa.gz")

SBS_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
                  "COSMIC_v3.4_SBS_GRCh37.txt")

drivers_url <- paste0("https://raw.githubusercontent.com/",
                      "caravagnalab/rRACES/main/inst/extdata/",
                      "driver_mutations_hg19.csv")

passenger_CNAs_url <- paste0("https://raw.githubusercontent.com/",
                             "caravagnalab/rRACES/main/inst/extdata/",
                             "passenger_CNAs_hg19.csv")

germline_url <- paste0("https://www.dropbox.com/scl/fi/g9oloxkip18tr1r",
                       "m6wjve/germline_data_demo.tar.gz?rlkey=15jshul",
                       "d3bqgyfcs7fa0bzqeo&dl=1")

m_engine <- build_mutation_engine(directory = "Test",
                                  reference_src = reference_url,
                                  SBS_src = SBS_url,
                                  drivers_src = drivers_url,
                                  passenger_CNAs_src = passenger_CNAs_url,
                                  germline_src = germline_url)

m_engine

m_engine$add_mutant(mutant_name = "A",
                    passenger_rates = c(SNV = 5e-8),
                    driver_SNVs = c(), driver_CNAs = c())

m_engine$add_exposure(c(SBS1=0.2, SBS5=0.8))

sim <- new(Simulation, "homogeneous_test",
           seed = 1,
           save_snapshot = F)

sim$duplicate_internal_cells <- T

sim$update_tissue("Liver", 2e3, 2e3)

sim$add_mutant(name = "A",
               growth_rates = 2,
               death_rates = 0)

sim$place_cell("A", 1000, 1000)

sim$run_up_to_size("A",1e4)

bbox = tibble(lower_corner = c(1000,1000),upper_corner = c(1050,1050))

sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)

forest <- sim$get_samples_forest()

plot_forest(forest)

phylo_forest <- m_engine$place_mutations(forest,500)

sampled_snvs = phylo_forest$get_sampled_cell_SNVs() %>% as_tibble() %>% 
  dplyr::select(-cell_id) %>% unique() %>% 
  mutate(id = paste0(
    "chr",
    chromosome,
    ":",
    chr_pos,
    ":",
    ref,
    ":",
    alt
  ))

seq_results <-
  simulate_seq(
    phylo_forest,
    coverage = 100,
    epi_FACS =  F,
    write_SAM = F
  )

i found Sampling.VAF > 1

print(seq_results)

seq_results = seq_results %>% 
  mutate(id = paste0(
    "chr",
    chromosome,
    ":",
    chr_pos,
    ":",
    ref,
    ":",
    alt
  ))

seq_results = full_join(seq_results %>% as_tibble(), sampled_snvs %>% dplyr::select(cause,id),
                        by = "id") %>% filter(!is.na(cause),!is.na(chromosome))

seq_results  %>% filter(Sampling.VAF > 0.4,cause != "Pre-neoplastic") %>% as_tibble()

phylo_forest$get_first_occurrences(SNV("22",19868218,"G","A"))

phylo_forest$get_nodes() %>% filter(cell_id == 18803)

plot_forest(forest)

![Uploading Screenshot 2024-03-04 alle 17.52.25.png…]()

albertocasagrande commented 8 months ago

Running the above code, I got

Error: chr22(19868218)[A>G] is a germinal mutation.

As far as the VAF>1 issue is concerned, I don't get any of them

> seq_results  %>% filter(Sampling.VAF > 1)
# A tibble: 0 × 11
# ℹ 11 variables: chromosome <chr>, chr_pos <int>, ref <chr>, alt <chr>, causes <chr>, classes <chr>, Sampling.occurrences <int>, Sampling.coverage <int>, Sampling.VAF <dbl>, id <chr>, cause <chr>

That issue was supposed to be solved in 4a1b47c9d689f4d107000ddfc5cfa2560ea86982 a week ago. Have you updated rRACES to the last github version?

riccardobergamin commented 8 months ago

@albertocasagrande , i have a new case of strange sequencing. I found germline at VAF = 1:

library(rRACES) 
library(ggplot2)
library(dplyr)

reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/current/",
                        "fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
                        "dna.chromosome.22.fa.gz")

SBS_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
                  "COSMIC_v3.4_SBS_GRCh37.txt")

drivers_url <- paste0("https://raw.githubusercontent.com/",
                      "caravagnalab/rRACES/main/inst/extdata/",
                      "driver_mutations_hg19.csv")

passenger_CNAs_url <- paste0("https://raw.githubusercontent.com/",
                             "caravagnalab/rRACES/main/inst/extdata/",
                             "passenger_CNAs_hg19.csv")

germline_url <- paste0("https://www.dropbox.com/scl/fi/g9oloxkip18tr1r",
                       "m6wjve/germline_data_demo.tar.gz?rlkey=15jshul",
                       "d3bqgyfcs7fa0bzqeo&dl=1")

# build a mutation engine and place all the files in the directory "Test"
m_engine <- build_mutation_engine(directory = "Test",
                                  reference_src = reference_url,
                                  SBS_src = SBS_url,
                                  drivers_src = drivers_url,
                                  passenger_CNAs_src = passenger_CNAs_url,
                                  germline_src = germline_url)

m_engine

m_engine$add_mutant(mutant_name = "A",
                    passenger_rates = c(SNV = 5e-8),
                    driver_SNVs = c(), driver_CNAs = c())

m_engine$add_exposure(c(SBS1=0.2, SBS5=0.8))

sim <- new(Simulation, "homogeneous_test",
           seed = 1,
           save_snapshot = F)

sim$duplicate_internal_cells <- T

sim$update_tissue("Liver", 2e3, 2e3)

sim$add_mutant(name = "A",
               growth_rates = 2,
               death_rates = 0)

sim$place_cell("A", 1000, 1000)

sim$run_up_to_size("A",1e4)

bbox = tibble(lower_corner = c(1000,1000),upper_corner = c(1050,1050))

sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)

forest <- sim$get_samples_forest()

plot_forest(forest)

phylo_forest <- m_engine$place_mutations(forest,500)

seq_results <-
  simulate_seq(
    phylo_forest,
    coverage = 100,
    epi_FACS =  F,
    write_SAM = F
  )

ggplot(seq_results) + geom_histogram(aes(x = Sampling.VAF))
seq_results %>% filter(Sampling.VAF > 0.8)

albertocasagrande commented 8 months ago

All the mutations having VAF=1 are germline and appear in both alleles. Please update to the last GitHub version (it contains the method PhylgeneticForest$get_germline_SNVs()) and try to execute the following code.

high_vaf_mut <- seq_results %>% filter(Sampling.VAF == 1)

high_vaf_mut %>% filter(classes != "germinal")

germline_snvs <- phylo_forest$get_germline_SNVs() %>%
                   count(chromosome, chr_pos, ref, alt)

for (row in seq_len(nrow(high_vaf_mut))) {
  in_cells <- germline_snvs %>%
                filter(.data$chromosome == high_vaf_mut[row, "chromosome"],
                       .data$chr_pos == high_vaf_mut[row, "chr_pos"],
                       .data$ref == high_vaf_mut[row, "ref"],
                       .data$alt == high_vaf_mut[row, "alt"])
  snv <- SNV(high_vaf_mut[row, "chromosome"], high_vaf_mut[row, "chr_pos"],
                       ref=high_vaf_mut[row, "ref"], alt=high_vaf_mut[row, "alt"])

  if (in_cells[1,"n"]<2) {
    print(paste0("The mutation num.", row," appears in one allele"))
  }
}

albertocasagrande commented 8 months ago

@riccardobergamin, I don't see any reason to keep this issue, so I am closing it.

Please feel free to open a new issue if any other suspect behavior arises.

caravagnalab / rRACES

some issue with the sequencing #81