Closed riccardobergamin closed 8 months ago
Can you post the full code example, please?
library(rRACES)
library(ggplot2)
library(dplyr)
library(ggpubr)
library(pbapply)
library(easypar)
reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/current/",
"fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
"dna.chromosome.22.fa.gz")
SBS_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
"COSMIC_v3.4_SBS_GRCh37.txt")
drivers_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"driver_mutations_hg19.csv")
passenger_CNAs_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"passenger_CNAs_hg19.csv")
germline_url <- paste0("https://www.dropbox.com/scl/fi/g9oloxkip18tr1r",
"m6wjve/germline_data_demo.tar.gz?rlkey=15jshul",
"d3bqgyfcs7fa0bzqeo&dl=1")
m_engine <- build_mutation_engine(directory = "Test",
reference_src = reference_url,
SBS_src = SBS_url,
drivers_src = drivers_url,
passenger_CNAs_src = passenger_CNAs_url,
germline_src = germline_url)
m_engine$add_mutant(mutant_name = "A",
passenger_rates = list("+" = c(SNV = 8e-7),
"-" = c(SNV = 8e-7)),
driver_SNVs = c(), driver_CNAs = c())
m_engine$add_exposure(c(SBS1=0.2, SBS5=0.8))
sim <- new(Simulation,
seed = sample(x = 1:1000,size = 1),
save_snapshot = F)
sim$duplicate_internal_cells <- T
sim$update_tissue("Liver", 2e3, 2e3)
# plastic
sim$add_mutant(name = "A",
epigenetic_rates = c("+-" = 0.005, "-+" = 0.001),
growth_rates = c("+" = 2.3,"-" = 1),
death_rates = c("+" = 0.0, "-" = 0.0))
sim$place_cell(starting_cell, 1e3, 1e3)
sim$run_up_to_size("A+",1e5)
n_w <- n_h <- 50
ncells <- 0.8*n_w*n_h
bbox <- sim$search_sample(c("A" = ncells), n_w, n_h)
sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)
forest <- sim$get_samples_forest()
phylo_forest <- m_engine$place_mutations(forest,500)
seq_results <-
simulate_seq(
phylo_forest,
coverage = 150,
epi_FACS = T,
write_SAM = F
)
@riccardobergamin, the previous code does not produce the issue mentioned by the first comment. Can you provide a full example that does exactly what you wrote?
@riccardobergamin and @caravagn, I figured out that the above behavior is due to the unbalanced number of +
and -
cells.
When seq_results
is called with epi_FACS = TRUE
, each of the samples collected by sample_cells
(in this case, only the sample "Sampling"
) is split into two sub-samples {sample name}_N
and {sample_name}_P
. The former contains all the cells in the original sample with epigenetic state -
, while the latter those with epigenetic state +
. The sequencing is then independently simulated over all the sub-samples with the same coverage (in our case, 150x per sub-sample).
Let us assume that Sampling_N
and Sampling_P
contain 10 and 990 cells, respectively. If a mutation appears in all of the cells in Sampling_N
in a single allele, then the column Sampling_N.VAF
of that mutation will be 0.5 even if it does not appear in any of the 990 cells in Sampling_P
.
You probably were expecting a VAF taking into account both +
and -
cells.
Should the coverage refers to the single sub-sample or the whole set of collected sample? Should the number of reads per sub-sample preserve the ratio between the number of cells in the sub-samples? What there are many samples (e.g., "Sampling" and "Sampling 2")?
Wait a second. Your implementation of +/- is correct.
@riccardobergamin did you realise it was in the + population? I presume yes but, I feel, you still think that VAF .5 is too high as the mutation should be subclonal in the +?
@caravagn, I could not analyze the case depicted in the plot because the seed was randomly selected in the interval [1,1000]
.
However, I fixed the seed to 26, and, for any SNV whose VAF is greater than 0.3 in "Sampling-", I computed the ratio between the number of cells in "Sampling-" that have the SNV and all of them. The minimum among these ratios is 0.5, which seems acceptable considering the random coverage fluctuations.
I did the same for "Sampling+," and I got 0.4388, which, once more, seems to be ok. You can find the full test code below.
@riccardobergamin, if you find a seed for which you find something different, please let me know.
library("rRACES")
library("dplyr")
# tissue evolution
sim <- new(Simulation, seed = 26, save_snapshot = FALSE)
sim$duplicate_internal_cells <- TRUE
sim$update_tissue("Liver", 2e3, 2e3)
sim$add_mutant(name = "A",
epigenetic_rates = c("+-" = 0.005, "-+" = 0.001),
growth_rates = c("+" = 2.3,"-" = 1),
death_rates = c("+" = 0.0, "-" = 0.0))
sim$place_cell("A+", 1e3, 1e3)
sim$run_up_to_size("A+", 1e5)
# sampling tissue
n_w <- n_h <- 50
ncells <- 0.8 * n_w * n_h
bbox <- sim$search_sample(c("A" = ncells), n_w, n_h)
sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)
forest <- sim$get_samples_forest()
# placing mutations
m_engine <- build_mutation_engine(setup_code="demo")
m_engine$add_mutant(mutant_name = "A",
passenger_rates = list("+" = c(SNV = 8e-7),
"-" = c(SNV = 8e-7)),
driver_SNVs = c(), driver_CNAs = c())
m_engine$add_exposure(c(SBS1 = 0.2, SBS5 = 0.8))
phylo_forest <- m_engine$place_mutations(forest, 500)
# sequencing simulation
seq_results <-
simulate_seq(
phylo_forest,
coverage = 150,
epi_FACS = TRUE,
write_SAM = FALSE
)
# testing VAF
for (sub_sample in list(c("-","Sampling_N.VAF"), c("+","Sampling_P.VAF"))) {
# let us consider pre-neoplastic and passenger-only mutations in "Sampling-"
# with VAF greater than 0.3
non_germ_large_VAF <- seq_results %>%
filter(!grepl("germinal", .data$classes), .data[[sub_sample[2]]] > 0.3)
# extract sampled cells in "Sampling{sub_sample[1]}"
sampled_cells <- phylo_forest$get_nodes() %>%
filter(epistate == sub_sample[1], !is.na(.data$sample))
# extract SNVs of sampled cells in "Sampling-"
sampled_cell_SNVs <- unique(phylo_forest$get_sampled_cell_SNVs() %>%
filter(cell_id %in% sampled_cells[,1]))
min_ratio <- NA
for (row in seq_len(nrow(non_germ_large_VAF))) {
# build the snv associated with the VAF
snv <- SNV(chromosome = non_germ_large_VAF[row, "chromosome"],
pos_in_chr = non_germ_large_VAF[row, "chr_pos"],
alt = non_germ_large_VAF[row, "alt"],
ref = non_germ_large_VAF[row, "ref"])
# find the id of the cells having the snv
cell_ids <- (sampled_cell_SNVs %>%
filter(chromosome == non_germ_large_VAF[row, "chromosome"],
chr_pos == non_germ_large_VAF[row, "chr_pos"],
ref == non_germ_large_VAF[row, "ref"],
alt == non_germ_large_VAF[row, "alt"]))["cell_id"]
# compute the ratio between the number of cells in the sampled cells in
# "Sampling-" that have the snv and all of them
ratio <- nrow(cell_ids) / nrow(sampled_cells)
# store the minimum among these ratios
if (is.na(min_ratio) || min_ratio > ratio) {
min_ratio <- ratio
}
}
print(paste0("The minimum ratio for \"", sub_sample[2], "\" is ", min_ratio))
}
It's impossible to debug non-reproducible code..
@albertocasagrande, try with this example:
library(rRACES)
library(ggplot2)
library(dplyr)
reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/current/",
"fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
"dna.chromosome.22.fa.gz")
SBS_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
"COSMIC_v3.4_SBS_GRCh37.txt")
drivers_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"driver_mutations_hg19.csv")
passenger_CNAs_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"passenger_CNAs_hg19.csv")
germline_url <- paste0("https://www.dropbox.com/scl/fi/g9oloxkip18tr1r",
"m6wjve/germline_data_demo.tar.gz?rlkey=15jshul",
"d3bqgyfcs7fa0bzqeo&dl=1")
m_engine <- build_mutation_engine(directory = "Test",
reference_src = reference_url,
SBS_src = SBS_url,
drivers_src = drivers_url,
passenger_CNAs_src = passenger_CNAs_url,
germline_src = germline_url)
m_engine
m_engine$add_mutant(mutant_name = "A",
passenger_rates = c(SNV = 5e-8),
driver_SNVs = c(), driver_CNAs = c())
m_engine$add_exposure(c(SBS1=0.2, SBS5=0.8))
sim <- new(Simulation, "homogeneous_test",
seed = 1,
save_snapshot = F)
sim$duplicate_internal_cells <- T
sim$update_tissue("Liver", 2e3, 2e3)
sim$add_mutant(name = "A",
growth_rates = 2,
death_rates = 0)
sim$place_cell("A", 1000, 1000)
sim$run_up_to_size("A",1e4)
bbox = tibble(lower_corner = c(1000,1000),upper_corner = c(1050,1050))
sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)
forest <- sim$get_samples_forest()
plot_forest(forest)
phylo_forest <- m_engine$place_mutations(forest,500)
sampled_snvs = phylo_forest$get_sampled_cell_SNVs() %>% as_tibble() %>%
dplyr::select(-cell_id) %>% unique() %>%
mutate(id = paste0(
"chr",
chromosome,
":",
chr_pos,
":",
ref,
":",
alt
))
seq_results <-
simulate_seq(
phylo_forest,
coverage = 100,
epi_FACS = F,
write_SAM = F
)
i found Sampling.VAF > 1
print(seq_results)
seq_results = seq_results %>%
mutate(id = paste0(
"chr",
chromosome,
":",
chr_pos,
":",
ref,
":",
alt
))
seq_results = full_join(seq_results %>% as_tibble(), sampled_snvs %>% dplyr::select(cause,id),
by = "id") %>% filter(!is.na(cause),!is.na(chromosome))
seq_results %>% filter(Sampling.VAF > 0.4,cause != "Pre-neoplastic") %>% as_tibble()
phylo_forest$get_first_occurrences(SNV("22",19868218,"G","A"))
phylo_forest$get_nodes() %>% filter(cell_id == 18803)
plot_forest(forest)
![Uploading Screenshot 2024-03-04 alle 17.52.25.png…]()
Running the above code, I got
Error: chr22(19868218)[A>G] is a germinal mutation.
As far as the VAF>1 issue is concerned, I don't get any of them
> seq_results %>% filter(Sampling.VAF > 1)
# A tibble: 0 × 11
# ℹ 11 variables: chromosome <chr>, chr_pos <int>, ref <chr>, alt <chr>, causes <chr>, classes <chr>, Sampling.occurrences <int>, Sampling.coverage <int>, Sampling.VAF <dbl>, id <chr>, cause <chr>
That issue was supposed to be solved in 4a1b47c9d689f4d107000ddfc5cfa2560ea86982 a week ago. Have you updated rRACES to the last github version?
@albertocasagrande , i have a new case of strange sequencing. I found germline at VAF = 1:
library(rRACES)
library(ggplot2)
library(dplyr)
reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/current/",
"fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
"dna.chromosome.22.fa.gz")
SBS_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
"COSMIC_v3.4_SBS_GRCh37.txt")
drivers_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"driver_mutations_hg19.csv")
passenger_CNAs_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"passenger_CNAs_hg19.csv")
germline_url <- paste0("https://www.dropbox.com/scl/fi/g9oloxkip18tr1r",
"m6wjve/germline_data_demo.tar.gz?rlkey=15jshul",
"d3bqgyfcs7fa0bzqeo&dl=1")
# build a mutation engine and place all the files in the directory "Test"
m_engine <- build_mutation_engine(directory = "Test",
reference_src = reference_url,
SBS_src = SBS_url,
drivers_src = drivers_url,
passenger_CNAs_src = passenger_CNAs_url,
germline_src = germline_url)
m_engine
m_engine$add_mutant(mutant_name = "A",
passenger_rates = c(SNV = 5e-8),
driver_SNVs = c(), driver_CNAs = c())
m_engine$add_exposure(c(SBS1=0.2, SBS5=0.8))
sim <- new(Simulation, "homogeneous_test",
seed = 1,
save_snapshot = F)
sim$duplicate_internal_cells <- T
sim$update_tissue("Liver", 2e3, 2e3)
sim$add_mutant(name = "A",
growth_rates = 2,
death_rates = 0)
sim$place_cell("A", 1000, 1000)
sim$run_up_to_size("A",1e4)
bbox = tibble(lower_corner = c(1000,1000),upper_corner = c(1050,1050))
sim$sample_cells("Sampling", bbox$lower_corner, bbox$upper_corner)
forest <- sim$get_samples_forest()
plot_forest(forest)
phylo_forest <- m_engine$place_mutations(forest,500)
seq_results <-
simulate_seq(
phylo_forest,
coverage = 100,
epi_FACS = F,
write_SAM = F
)
ggplot(seq_results) + geom_histogram(aes(x = Sampling.VAF))
seq_results %>% filter(Sampling.VAF > 0.8)
All the mutations having VAF=1 are germline and appear in both alleles. Please update to the last GitHub version (it contains the method PhylgeneticForest$get_germline_SNVs()
) and try to execute the following code.
high_vaf_mut <- seq_results %>% filter(Sampling.VAF == 1)
high_vaf_mut %>% filter(classes != "germinal")
germline_snvs <- phylo_forest$get_germline_SNVs() %>%
count(chromosome, chr_pos, ref, alt)
for (row in seq_len(nrow(high_vaf_mut))) {
in_cells <- germline_snvs %>%
filter(.data$chromosome == high_vaf_mut[row, "chromosome"],
.data$chr_pos == high_vaf_mut[row, "chr_pos"],
.data$ref == high_vaf_mut[row, "ref"],
.data$alt == high_vaf_mut[row, "alt"])
snv <- SNV(high_vaf_mut[row, "chromosome"], high_vaf_mut[row, "chr_pos"],
ref=high_vaf_mut[row, "ref"], alt=high_vaf_mut[row, "alt"])
if (in_cells[1,"n"]<2) {
print(paste0("The mutation num.", row," appears in one allele"))
}
}
@riccardobergamin, I don't see any reason to keep this issue, so I am closing it.
Please feel free to open a new issue if any other suspect behavior arises.
Occasionaly It happens that some mutation that arises late in the evolution has a vaf around 0.5, which is bit unrealistic, since i am using very high coverage (150). For instance, considering
Searching in the forest i find
where the birth time is quite high if we look at the forest
Also we are not imposing the infinite sites assumption, since the same mutation appeared in 2 different cell ids. Should we impose it? Tipically people assume it in literature.