pblaney commented 4 years ago

Hi Akdes,

I have been working my way through the pipeline and have run into an issue that I have been trying to work through the past few days with no luck. I have no issue with all the preceding commands up to the runCaSpER() call:

casperObj <- CreateCasperObject(raw.data = geneExprMatrix,
                                annotation = annotationTable,
                                cytoband = hg38Cytoband,
                                loh = bafLoh,
                                loh.name.mapping = bafLohNameMapping,
                                control.sample.ids = controlName,
                                sequencing.type = "bulk",
                                method = "iterative",
                                cnv.scale = 3,
                                loh.scale = 3,
                                genomeVersion = "hg38",
                                log.transformed = FALSE
                                )

casperCnvCalls <- runCaSpER(object = casperObj,
                            removeCentromere = TRUE,
                            cytoband = casperObj@cytoband,
                            method = "iterative"
                            )

Which yields the following error: Error in if (object@segments$size[i] > pair_arm_sizes * (1/3)) object@segments$event_scale[i] <- "large_scale" : missing value where TRUE/FALSE needed

Based on the error, one or both of the values is NA but as I've worked back through the PerformSegmentationWithHMM() function I cannot find a reason why.

To ensure it wasn't a dependency issue, I ran the meningioma demo and had no issue. I then went through each portion of my data to make sure the classes lined up with the examples, especially the cytoband information, but still resulted in the same error.

Any suggestions? Patrick

akdess commented 4 years ago

Hi @pblaney , can you share your R object with me akdess [at] gmail. com ? I will be more helpful if I can see the data. Thanks!

akdess commented 4 years ago

Hi @pblaney,

object@loh.name.mapping must have the following column names: sample.name and loh.name

and hg38 cytoband should be generating using the following commands; http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz cytoband <- read.delim("cytoBand.txt", header=F) cytoband$V4 <-substring(cytoband$V4, 1, 1) start <- do.call(rbind, lapply(split(cytoband$V2, paste0(cytoband$V1, cytoband$V4)), min)) end <- do.call(rbind, lapply(split(cytoband$V3, paste0(cytoband$V1, cytoband$V4)), max)) cytoband <- data.frame(V1=gsub("p", "", gsub("q", "", rownames(start))), V2=start, V3=end, V4=rownames(start), stringsAsFactors=F) cytoband$V4[grep("q", cytoband$V4)] <- "q" cytoband$V4[grep("p", cytoband$V4)] <- "p" cytoband <- cytoband [ as.vector(sapply(c(1:22, "X"), function(x) which(cytoband$V1 %in% x))), ] rownames(cytoband) <- NULL

dstueckm commented 4 years ago

Hello,

I'm still experiencing issues when trying to run runCaSpER(), and looking at this issue I think my problem may be with the cytoband input into CreateCasperObject(). I can't find any documentation as to what the expected input for cytoband is, and your code for the creation of a cytoband (above, and in the Vignette) throws an error for me. Could you provide a cytoband object or state the format of what the cytoband object should be?

Thank you

akdess commented 4 years ago

I am sorry for the error. I have updated the tutorial.

I have included the data object in https://github.com/akdess/CaSpER/blob/master/data/hg38_cytoband.rda

And I used below code to generate cytoband data:

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz cytoband <- read.delim("cytoBand.txt", header=F) cytoband <- data.frame(V1=gsub("chr", "", cytoband[,1]), V2=cytoband[,2], V3=cytoband[,3], V4=substring(cytoband$V4, 1, 1), stringsAsFactors=F) start <- do.call(rbind, lapply(split(cytoband$V2, paste0(cytoband$V1, cytoband$V4)), min)) end <- do.call(rbind, lapply(split(cytoband$V3, paste0(cytoband$V1, cytoband$V4)), max)) cytoband <- data.frame(V1=gsub("p", "", gsub("q", "", rownames(start))), V2=start, V3=end, V4=rownames(start), stringsAsFactors=F) cytoband <- cytoband [as.vector(unlist(sapply(c(1:22, "X"), function(x) which(cytoband$V1 %in% x)))), ] cytoband$V4[grep("q", cytoband$V4)] <- "q" cytoband$V4[grep("p", cytoband$V4)] <- "p" rownames(cytoband) <- NULL

dstueckm commented 4 years ago

Thank you for the reply!

Unfortunately the cytoband data was not what was causing my issue, and I'm still stuck on this. I will email over a random sample of my object, and if you could take a look at this I would greatly appreciate it.

mijhamilt commented 4 years ago

Need help! Error with running runCaSpER:

object <- CreateCasperObject(raw.data=m.rpkm, sequencing.type="bulk", loh.name.mapping = loh.name.mapping,cnv.scale=3, loh.scale=3, control.sample.ids = c("n1","n2","n3"),annotation=annotation, method="iterative", loh=loh,cytoband=centromere, genomeVersion = "hg38") Performing Median Filtering... Scale:1... Performing Median Filtering... Scale:2... Performing Median Filtering... Scale:3... There were 24 warnings (use warnings() to see them)

final.objects <- runCaSpER(object, removeCentromere=T, cytoband=centromere, method="iterative") Performing recursive median filtering... Performing HMM segmentation... Error in DataFrame(..., check.names = FALSE) : different row counts implied by arguments In addition: There were 50 or more warnings (use warnings() to see the first 50)

Warnings look like- In runmed(x, filt$n) : 'k' is bigger than 'n'! Changing 'k' to 43

mijhamilt commented 4 years ago

I fixed the above. My annotation file had different dimension from my gene matrix. Now, I have a new error:

final.objects <- runCaSpER(object, removeCentromere=T, cytoband=cytoband, method="iterative") Performing recursive median filtering... Performing HMM segmentation... Processing cnv.scale:1 loh.scale:1... Error in value[[jvseq[[jjj]]]] : subscript out of bounds

akdess commented 4 years ago

Hi @mijhamilt thanks a lot for your input!

The error might be because you should do: names(loh) <- gsub(".baf", "", names(loh))

or because:

the "loh" list structure should only contain the samples in your loh.name.mapping. All the names(loh) should be in loh.name.mapping$loh.name.

Hope it helps. I will update the code and tutorial to make it clear. Thanks a lot!

mijhamilt commented 4 years ago

Thank you for your help! It is working 🙂

Three question:

In the attachment, I set samples n1,n2 and n3 as control samples. Why does n1 show an amplification?
I don't see where I am defining geno.rna. I can't run hits <- findOverlaps(geno.rna, ann.gr)
If I am getting NAs in my plotMUAndCooccurence(results), does that mean I have no MU and cooccurence?

thanks, Michael

From: Akdes Serin Harmanci notifications@github.com Sent: Thursday, April 23, 2020 9:41 PM To: akdess/CaSpER CaSpER@noreply.github.com Cc: Hamilton, Michael mihamilton@health.ucsd.edu; Mention mention@noreply.github.com Subject: Re: [akdess/CaSpER] Troubleshooting Error with runCaSpER (#12)

Hi @mijhamilthttps://github.com/mijhamilt thanks a lot for your input!

The error might be because you should do: names(loh) <- gsub(".baf", "", names(loh))

or because:

the "loh" list structure should only contain the samples in your loh.name.mapping. All the names(loh) should be in loh.name.mapping$loh.name.

Hope it helps. I will update the code and tutorial to make it clear. Thanks a lot!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/akdess/CaSpER/issues/12#issuecomment-618797943, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN2GK5VCKRISRSDA5HNA4W3ROEKATANCNFSM4KUEAHWQ.

akdess commented 4 years ago

Hi @mijhamilt,

I am sorry but I can not see the attachment.

There was a typo in the tcga_BRCA.R demo code it should be rna. I will correct the demo R script. Thanks for letting me know. In meningioma demo R code I believe it is correct:

segment based summary

library(GenomicRanges) gamma <- 7 all.segments <- do.call(rbind, lapply(final.objects, function(x) x@segments)) segment.summary <- extractSegmentSummary (final.objects) loss <- segment.summary$all.summary.loss gain <- segment.summary$all.summary.gain loh <- segment.summary$all.summary.loh

loss.final <- loss[loss$count>=gamma, ] gain.final <- gain[gain$count>=gamma, ] loh.final <- loh[loh$count>=gamma, ]

gene based summary

all.summary<- rbind(loss.final, gain.final) colnames(all.summary) [2:4] <- c("Chromosome", "Start", "End") rna <- GRanges(seqnames = Rle(gsub("q", "", gsub("p", "", all.summary$Chromosome))), IRanges(all.summary$Start, all.summary$End))
ann.gr <- makeGRangesFromDataFrame(final.objects[[1]]@annotation.filt, keep.extra.columns = TRUE, seqnames.field="Chr") hits <- findOverlaps(rna, ann.gr) genes <- splitByOverlap(ann.gr, rna, "GeneSymbol") genes.ann <- lapply(genes, function(x) x[!(x=="")]) all.genes <- unique(final.objects[[1]]@annotation.filt[,2]) all.samples <- unique(as.character(final.objects[[1]]@segments$ID)) rna.matrix <- gene.matrix(seg=all.summary, all.genes=all.genes, all.samples=all.samples, genes.ann=genes.ann)

I am not sure about the NAs in plotMUAndCooccurence code. If you want you can send me your R data object by email: akdess [at] gmail com

Hope it helps.

mijhamilt commented 4 years ago

Thank you! I caught the the type also 🙂

Quick question......I need to do this pipeline for mouse. Would you have the mm10 generate genome_fasta_pileup_dir files and genome_list files. If not, is there a easy to generate them?

thank you, Michael

From: Akdes Serin Harmanci notifications@github.com Sent: Sunday, April 26, 2020 8:20 PM To: akdess/CaSpER CaSpER@noreply.github.com Cc: Hamilton, Michael mihamilton@health.ucsd.edu; Mention mention@noreply.github.com Subject: Re: [akdess/CaSpER] Troubleshooting Error with runCaSpER (#12)

Hi @mijhamilthttps://github.com/mijhamilt,

I am sorry but I can not see the attachment.

I think there was a typo in the code it should be rna. I will correct the demo R script. Thanks for letting me know.

I am not sure about the NAs in plotMUAndCooccurence code. If you want you can send me your R data object by email: akdess [at] gmail com

Hope it helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/akdess/CaSpER/issues/12#issuecomment-619689344, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN2GK5RN7CZUL56OSWDB2YLROT2YRANCNFSM4KUEAHWQ.

akdess commented 4 years ago

Hi @mijhamilt, I have already created those files it should be in https://www.dropbox.com/s/z7lq8pint83o2dw/mm10.zip?dl=0

For mm10 I should make some changes to Casper code. If you send me an email akdess at gmail com I can send you the modified version of Casper for mm10. We will try to update Casper soon to be compatible with mm10. Thanks a lot!

mehlu22 commented 1 month ago

For my code here:

options(download.file.method = "wininet")

program to run Casper

library(data.table) library(CaSpER) library(edgeR)

sub <- read.csv('./cod_length.csv', row.names = 1) dat <- fread("SL1to2_raw.csv") dat$GeneID <- sub("\..*", "", dat$GeneID) sub <- sub <- sub[sub$ensm %in% dat$GeneID, ]

cytoband <- cytoband_hg38

Generate the LOH annotation

rannotation <- generateAnnotation(id_type = "ensembl_gene_id", genes = dat$GeneID, ishg19 = FALSE, centromere)

Extract gene IDs from the LOH annotation (adjust if necessary based on the structure)

loh_genes <- unique(rannotation$Gene) # Change 'id' if needed based on the structure of rannotation

Find the common genes between the gene length data, gene expression data, and LOH annotation

common_genes <- intersect(intersect(dat$GeneID, sub$ensm), loh_genes)

Filter the gene expression data to retain only the common genes

dat_filtered <- dat[dat$GeneID %in% common_genes, ]

Filter the gene length data (`sub`) to retain only the common genes

sub_filtered <- sub[sub$ensm %in% common_genes, ]

Optionally, filter the LOH annotation to keep only the common genes

rannotation_filtered <- rannotation[rannotation$Gene %in% common_genes, ] # Adjust field name based on rannotation structure

Ensure that the filtered data is aligned

cat("Number of common genes: ", length(common_genes), "\n")

List of paths

paths <- c("./SL1/", "./SL2/") folder_names <- basename(paths) folder_names <- gsub("/$", "", folder_names)

Define a function to read the output, unlist it, and assign name

read_and_name <- function(path) { result <- readBAFExtractOutput(path = path, sequencing.type = "bulk") unlisted_result <- result[[1]] return(unlisted_result) }

Use lapply to apply the function to each path

loh <- lapply(paths, read_and_name) names(loh) <- gsub(".baf", "", names(loh))

control.sample.ids = "PDA75"

loh.name.mapping <- data.frame(loh.name = folder_names, sample.name = folder_names)

dat_matrix <- as.matrix(data.frame(dat_filtered[!duplicated(dat_filtered$GeneID),], row.names = 1, check.names = F))

convert to rpkm

dat_matrix <- rpkm(dat_matrix, gene.length = sub_filtered$Length, log=T)

dat_matrix[is.na(dat_matrix)] <- 0

object <- CreateCasperObject(raw.data=dat_matrix, loh.name.mapping=loh.name.mapping, sequencing.type="bulk", cnv.scale=3, loh.scale=3, matrix.type="normalized", expr.cutoff=1, annotation=rannotation_filtered, method="iterative", loh=loh, filter="median",
control.sample.ids=control.sample.ids, cytoband=centromere, genomeVersion="hg38", log.transformed = T)

Pairwise comparison of scales from BAF and expression signals

final.objects <- runCaSpER(object, removeCentromere=T, cytoband=cytoband, method="iterative")

Harmonization and Summarization of CNV calls from multiple scales and from multiple pairwise comparison of BAF and Expression Signals

Large-Scale CNV Summarization.

We assign a large-scale CNV call to every chromosome arm for each of the N×M

pairwise scale comparisons. Next, for each chromosome arm, we ask whether the

large-scale CNV call is consistent among at least γ of the N×M large-scale CNV

calls. N denotes the index for the highest smoothing scale for expression signal.

M denotes the index for the highest smoothing scale for baf signal. thr represents

minimum percentage, 75% (at least 7 out of 9 scales), of consistent CNV calls

(Out of N×M comparisons of expression scales and BAF scales) while assigning a

final CNV (amp/del/neutral) call to a segment/gene/chromosome arm.

finalChrMat <- extractLargeScaleEvents (final.objects, thr=0.75)

Segment based CNV Summarization.

The segments-based summarization aims at generating a final set of CNV calls

for a final set of segments that are computed by comparison of scales. We first

compare the segments from different expression scales and generate the consistent

set of segments. For each segment in the final set, if there are more than γ

(default=6) consistent CNV calls among N×M CNV calls, we assign the consistent

CNV call to segment. When there is no consistency among the calls, we assign a

neutral CNV state to segment.

gamma <- 6 all.segments <- do.call(rbind, lapply(final.objects, function(x) x@segments)) segment.summary <- extractSegmentSummary (final.objects) loss <- segment.summary$all.summary.loss gain <- segment.summary$all.summary.gain loh <- segment.summary$all.summary.loh loss.final <- loss[loss$count>gamma, ] gain.final <- gain[gain$count>gamma, ] loh.final <- loh[loh$count>gamma, ]

Gene based CNV Summarization.

Similar to the large-scale summarization, we generate a matrix where rows are

the samples (cells) and columns are the genes. The matrix entry of 0 corresponds

to no alteration, 1 corresponds to amplification and -1 corresponds to deletion.

If an alteration is consistent in more than γ scale comparisons

(out of N×M comparisons), we report that alteration event for that sample.

all.summary<- rbind(loss.final, gain.final) colnames(all.summary) [2:4] <- c("Chromosome", "Start", "End") rna <- GRanges(seqnames = Rle(gsub("q", "", gsub("p", "", all.summary$Chromosome))), IRanges(all.summary$Start, all.summary$End))
ann.gr <- makeGRangesFromDataFrame(final.objects[[1]]@annotation.filt, keep.extra.columns = TRUE, seqnames.field="Chr") hits <- findOverlaps(geno.rna, ann.gr) genes <- splitByOverlap(ann.gr, geno.rna, "GeneSymbol") genes.ann <- lapply(genes, function(x) x[!(x=="")]) all.genes <- unique(final.objects[[1]]@annotation.filt[,2]) all.samples <- unique(as.character(final.objects[[1]]@segments$ID)) rna.matrix <- gene.matrix(seg=all.summary, all.genes=all.genes, all.samples=all.samples, genes.ann=genes.ann)

Visualization

plotHeatmap

obj <- final.objects[[9]] plotHeatmap(object=obj, fileName="heatmap.png",cnv.scale= 3, cluster_cols = F, cluster_rows = T, show_rownames = T, only_soi = T)

plotLargeScaleEvent:

plotLargeScaleEvent(object=obj, fileName="large.scale.events.png")

plotGEAndGT:

plotGEAndGT(chrMat=finalChrMat, genoMat=genoMat, fileName="RNASeqAndGT.png")

plotBAFAllSamples

plotBAFAllSamples (loh = obj@loh.median.filtered.data, fileName="LOHAllSamples.png")

plotBAFOneSample:

plotBAFOneSample (object, fileName="LOHPlotsAllScales.pdf")

plotBAFInSeperatePages

plotBAFInSeperatePages (loh=obj@loh.median.filtered.data, folderName="LOHPlots")

plotGEAndBAFOneSample:

plotGEAndBAFOneSample (object=obj, cnv.scale=3, loh.scale=3, sample= "MN-5")

Visualization options specific for single-cell RNA-Seq data

plotSingleCellLargeScaleEventHeatmap(finalChrMat, sampleName="MGH31", chrs=c("5p", "14q"))

plotMUAndCooccurence:

calculate significant mutual exclusive and co-occurent events

results <- extractMUAndCooccurence (finalChrMat, loh, loh.name.mapping)

visualize mutual exclusive and co-occurent events

plotMUAndCooccurence (results)

It is showing this error. Please help:

final.objects <- runCaSpER(object, removeCentromere=T, cytoband=cytoband, method="iterative") Performing recursive median filtering... Performing HMM segmentation... Processing cnv.scale:1 loh.scale:1... Error in value[[jvseq[[jjj]]]] : subscript out of bounds

akdess / CaSpER

Troubleshooting Error with runCaSpER #12

segment based summary

gene based summary

options(download.file.method = "wininet")

program to run Casper

Generate the LOH annotation

Extract gene IDs from the LOH annotation (adjust if necessary based on the structure)

Find the common genes between the gene length data, gene expression data, and LOH annotation

Filter the gene expression data to retain only the common genes

Filter the gene length data (sub) to retain only the common genes

Optionally, filter the LOH annotation to keep only the common genes

Ensure that the filtered data is aligned

List of paths

Define a function to read the output, unlist it, and assign name

Use lapply to apply the function to each path

convert to rpkm

dat_matrix[is.na(dat_matrix)] <- 0

Pairwise comparison of scales from BAF and expression signals

Harmonization and Summarization of CNV calls from multiple scales and from multiple pairwise comparison of BAF and Expression Signals

Large-Scale CNV Summarization.

We assign a large-scale CNV call to every chromosome arm for each of the N×M

pairwise scale comparisons. Next, for each chromosome arm, we ask whether the

large-scale CNV call is consistent among at least γ of the N×M large-scale CNV

calls. N denotes the index for the highest smoothing scale for expression signal.

M denotes the index for the highest smoothing scale for baf signal. thr represents

minimum percentage, 75% (at least 7 out of 9 scales), of consistent CNV calls

(Out of N×M comparisons of expression scales and BAF scales) while assigning a

final CNV (amp/del/neutral) call to a segment/gene/chromosome arm.

Segment based CNV Summarization.

The segments-based summarization aims at generating a final set of CNV calls

for a final set of segments that are computed by comparison of scales. We first

compare the segments from different expression scales and generate the consistent

set of segments. For each segment in the final set, if there are more than γ

(default=6) consistent CNV calls among N×M CNV calls, we assign the consistent

CNV call to segment. When there is no consistency among the calls, we assign a

neutral CNV state to segment.

Gene based CNV Summarization.

Similar to the large-scale summarization, we generate a matrix where rows are

the samples (cells) and columns are the genes. The matrix entry of 0 corresponds

to no alteration, 1 corresponds to amplification and -1 corresponds to deletion.

If an alteration is consistent in more than γ scale comparisons

(out of N×M comparisons), we report that alteration event for that sample.

Visualization

plotHeatmap

plotLargeScaleEvent:

plotGEAndGT:

plotBAFAllSamples

plotBAFOneSample:

plotBAFInSeperatePages

plotGEAndBAFOneSample:

Visualization options specific for single-cell RNA-Seq data

plotMUAndCooccurence:

calculate significant mutual exclusive and co-occurent events

visualize mutual exclusive and co-occurent events

Filter the gene length data (`sub`) to retain only the common genes