bioinf-jku / panelcn.mops

CNV detection tool for targeted NGS panel data
16 stars 9 forks source link

Problem plotting exons from a certain candidate gene #20

Open Yatros opened 3 years ago

Yatros commented 3 years ago

Hello,

First of all, thank you very much for developing panelcn.mops. It is a really helpful tool.

I am having a problem when plotting all exons of a candidate gene from a panel. The problem is basically that if the gene is in the forward strand, the first exon does not show up in the plot and if the gene is in the reverse strand, the last exon of the gene is not plotted. My bed file is sorted by chr, position, so basically the plotBoxplot() command is not plotting the first exon that appears in the bed file.

However, the results table created with the createResultTable() command shows all the exons from each candidate gene.

Here you have the code that I have used to analyze my gene panel so that you can figure out if I have made any errors:

# Load library
library(panelcn.mops)

# Read input files
bed <- "Genes_with_exon_coordinates_sorted_31bp_flanking_sides.bed"
countWindows <- getWindows(bed)

# Sample to analyze
Sample_X_bam <- "BAMs/Sample_X.bam"

Sample_X <- countBamListInGRanges(countWindows = countWindows,
                                  bam.files = Sample_X_bam, read.width = 150) 

# Control samples
Sample_A_bam <- "BAMs/Sample_A.bam"
Sample_B_bam <- "BAMs/Sample_B.bam"
Sample_C_bam <- "BAMs/Sample_C.bam"
Sample_D_bam <- "BAMs/Sample_D.bam"
Sample_E_bam <- "BAMs/Sample_E.bam"

Sample_A <- countBamListInGRanges(countWindows = countWindows,
                                 bam.files = Sample_A_bam, read.width = 150)
Sample_B <- countBamListInGRanges(countWindows = countWindows,
                                 bam.files = Sample_B_bam, read.width = 150)
Sample_C <- countBamListInGRanges(countWindows = countWindows,
                                 bam.files = Sample_C_bam, read.width = 150)
Sample_D <- countBamListInGRanges(countWindows = countWindows,
                                 bam.files = Sample_D_bam, read.width = 150)
Sample_E <- countBamListInGRanges(countWindows = countWindows,
                                 bam.files = Sample_E_bam, read.width = 150)

# [...] I have 109 controls in my dataset

controls <- countBamListInGRanges(countWindows = countWindows,
                                  bam.files = c(Sample_A_bam, Sample_B_bam, Sample_C_bam, Sample_D_bam, 
                                                Sample_E_bam),
                                  read.width = 150)

# The gene panel contains 26 genes, but I am only analyzing 4 of them
selectedGenes <- c("GENE_1", "GENE_2", "GENE_3", "GENE_4")

XandCB <- Sample_X

elementMetadata(XandCB) <- cbind(elementMetadata(XandCB),
                                 elementMetadata(controls))

resultlist <- runPanelcnMops(XandCB, countWindows = countWindows,
                             selectedGenes = selectedGenes)

# Although the countWindows contains 548 obs. of 6 variables, the resultlist identifies only 495 Iranges

(str(resultlist[[1]]))

integerCopyNumber(resultlist[[1]])[1:5]

sampleNames <- colnames(elementMetadata(Sample_X))

resulttable <- createResultTable(resultlist = resultlist, XandCB = XandCB,
                                 countWindows = countWindows,
                                 selectedGenes = selectedGenes,
                                 sampleNames = sampleNames)

# This table shows all exons from the 4 genes
resulttable[[1]]

# GENE_1 - Located in the reverse strand - Does not plot last exon
plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
            countWindows = countWindows,
            selectedGenes = selectedGenes, showGene = 1)

# GENE_2 - Located in the forward strand - Does not plot first exon
plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
            countWindows = countWindows,
            selectedGenes = selectedGenes, showGene = 2)

# GENE_3 - Located in the forward strand - Does not plot first exon
plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
            countWindows = countWindows,
            selectedGenes = selectedGenes, showGene = 3)

# GENE_4 - Located in the reverse strand - Does not plot last exon
plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
            countWindows = countWindows,
            selectedGenes = selectedGenes, showGene = 4)

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] panelcn.mops_1.9.0   cn.mops_1.36.0       GenomicRanges_1.42.0 GenomeInfoDb_1.26.2  IRanges_2.24.0      
[6] S4Vectors_0.28.1     BiocGenerics_0.36.0 

loaded via a namespace (and not attached):
 [1] compiler_4.0.3         BiocManager_1.30.10    XVector_0.30.0         remotes_2.2.0         
 [5] prettyunits_1.1.1      bitops_1.0-6           tools_4.0.3            zlibbioc_1.36.0       
 [9] testthat_3.0.0         digest_0.6.27          pkgbuild_1.1.0         pkgload_1.1.0         
[13] lifecycle_0.2.0        memoise_1.1.0          rlang_0.4.9            cli_2.2.0             
[17] rstudioapi_0.13        curl_4.3               GenomeInfoDbData_1.2.4 withr_2.3.0           
[21] fs_1.5.0               Biostrings_2.58.0      desc_1.2.0             devtools_2.3.2        
[25] rprojroot_2.0.2        glue_1.4.2             Biobase_2.50.0         R6_2.5.0              
[29] processx_3.4.5         fansi_0.4.1            exomeCopy_1.36.0       BiocParallel_1.24.1   
[33] sessioninfo_1.1.1      purrr_0.3.4            magrittr_2.0.1         callr_3.5.1           
[37] usethis_2.0.0          Rsamtools_2.6.0        ps_1.5.0               ellipsis_0.3.1        
[41] assertthat_0.2.1       RCurl_1.98-1.2         crayon_1.3.4     

Graphical example:

Analysis performed with +/- 31 bp flanking region for each exon. In this case Ex12 is missing from the plot:

image

image

I have continued doing tests and I have tried to sort the bed file according to transcriptional order (Ex1, Ex2, Ex3, ...) to see if the function plotBoxplot() would plot the Ex1 for the genes from the reverse strand. The analysis is performed in the right way when you visualize it with the resulttable <- createResultTable() function. This analysis was performed without the +/- 31 bp flanking region for each exon. Therefore the exon sizes and plots are slightly different, but the concept is the same one:

image

The plotBoxplot() function does not plot the Ex1 for the reordered exons, as expected.

Additionally, I have found a potential "bug" in the plotBoxplot() function. It reverts the X-axis labels according to the new bed file (Ex1, Ex2, Ex3, Ex4,....) , but the boxplots stay in the same order, that is in chromosomal order (Ex12, Ex11, Ex10, Ex9, ...), which would generate an erroneous plot:

image

Now the sample would have an Ex7-Ex10 deletion, instead of the real Ex3-Ex6 deletion that is shown in both tables.

In order for the plotBoxplot() function to work properly, the bed file needs to be sorted in chromosomal order. If not, you can get an erroneous plot, even when the analysis and the table show the correct results.

Am I doing something wrong? How can I plot the first exon of each gene?

I have even tried to add the UTR region to the bed file to have a first record for each gene before the first exon, but that does not work either.

Does anybody have had these issues before?

Thank you very much,

Best Regards,

Yatros commented 3 years ago

Hello,

I have tried to perform an additional test with the plotBoxplot() function. In this case, I have tried to plot the results from two different genes (GENE_1 located in the minus strand of chr6 and GENE_2 located in the plus strand of chr1) in the same plot.

plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
            countWindows = countWindows,
            selectedGenes = selectedGenes, showGene = c(1,2))

image

In this case, the plotBoxplot() function plots both genes in chromosomal order, that is GENE_2, GENE_1 and it is able to plot the 12 exons from GENE_1, but it misses to plot the first exon from GENE_2.

I think that the problem may be the way in which the function indexes the exons from each gene.

If I try to specify exonRange = c(1,12)for GENE_1 when plotting it alone, the function returns the following error:

plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
+             countWindows = countWindows, exonRange = c(1,12),
+             selectedGenes = selectedGenes, showGene = 1)

Error in plotData[(exonRange[1] + 1):(exonRange[2] + 1), ] : subscript out of bounds

It looks like the first exon was always "exonRange[1] + 1" instead of "exonRange[1] + 0"

I hope that these examples show the problem that I have and that the developer of the package can explain how to perform the plots in the right way.

A suggestion: since the plotBoxplot() function only works if the bed file is sorted in chromosomal order, it would be nice to have the option of flipping the plot for those genes that are located in the minus strand. This is especially important to keep consistency across plots for publication purposes.

Please let me know if you need any additional data to reproduce the error.

Thanks again,

Yatros commented 3 years ago

Hi again,

I have found a way of "fixing" the plotting problem. I added a "fake" line at the very beginning of my bed file. This additional "fake exon" overlaps partially with the first exon of my first real gene. Here you have the headers of my original bed file and of the "fixed" bed file.

Original bed file: image

"Fixed" bed file: image

It looks as everything was off by one record. Now all genes are plotted in the right way and the plots include all exons for each gene.

However, as I suggested previously it would be great to have the option of flipping the plots so that the genes that are located in the minus strand can be plotted in transcriptional order (i.e. Ex1, Ex2, Ex3, ...), instead of chromosomal order (i.e. Ex12, Ex11, Ex10...).

Current plot of GENE_1 once the bed file has been "fixed":

image

Now that I think about it, maybe the software was expecting a bed file with a header, but that is not what is specified in the panelcn.MOPS documentation regarding the bed file format.

Best,

gpovysil commented 3 years ago

Great catch. For some reason, I never plotted the first gene in my bed file. I know what the problem is and I am working on a solution.

gpovysil commented 3 years ago

I fixed the issue and also changed it so that if you order the countWindows differently the order will change accordingly in the plot (labels and data). So far I only committed it here and not to bioconductor. Can you please test if it solves your problem? Thanks

Yatros commented 3 years ago

Hello,

I have checked the current version of the package and there are certain things that have been fixed, but it doesn't work 100% correctly, yet.

I'll explain you step by step my reasoning process to see if it make sense to you:

My bed file contains 24 genes located in autosomal chromosomes. In total there are 506 exons, one per line.

As expected, the countWindows <- getWindows(bed) command generates a data frame with 506 observations of 6 variables each ('chromosome', 'start', 'end', 'name', 'gene' and 'exon').

However, when I run the resultlist <- runPanelcnMops(XandCB, countWindows = countWindows, selectedGenes = selectedGenes), the runPanelcnMops() function only recognizes 495 regions of interest (ROIs). See following picture:

image

I don't know the reason for these ROIs not being recognized. I guess that they may have a low read count in the original BAM file, since they affect most of the times the first or last exon of the gene. There is only one case, where it affects exons 1-4 from one gene (GENE_8). This problem already existed in the previous version of the software, too. I have always been able to include only 495 ROIs in all my analyses. The createResultTable() function is able to show the results for the 495 ROIs that were identified by the runPanelcnMops() function whenever I perform any analysis.

But every time I try to plot a gene that has a different number of exons in the bedfile/countWindows data frame compared to the number of ROIs identified by the runPanelcnMops() function, the plotBoxplot() function returns the following error:

> plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
+             countWindows = countWindows,
+             selectedGenes = selectedGenes, showGene = 2)
Error in plotData[geneWindowsPaste, ] : subscript out of bounds

Additionally, the first gene of the bed file cannot be plotted even though the number of records in the bedfile and the number of ROIs identified by the runPanelcnMops() function is exactly the same (n = 6; see picture above). The error looks exactly the same:

> plotBoxplot(result = resultlist[[1]], sampleName = sampleNames[1],
+             countWindows = countWindows,
+             selectedGenes = selectedGenes, showGene = 1)
Error in plotData[geneWindowsPaste, ] : subscript out of bounds

Only those genes that match the number of exons between the countWindow data frame and the resultlist object are plotted. In this case, they are plotted in the right way, that is including all exons from each gene.

In the previous version of the package, the plotBoxplot() function plotted all genes in both cases:

  1. If the number of records in countWindow and in the resultlist was different.
  2. If the number of records in both objects was exactly the same.

In both cases, all plots were always lacking one exon, though.

I haven't had time to check if the order of the countWindows works properly or not and I want to fix the current problem before digging further into other potential errors.

Let me know if you have any doubts or need further clarification of what I am doing.

I can send you my real bed file if you want to have a look at it, too.

Thanks again for your answer,

Best,

gpovysil commented 3 years ago

Sorry for the delay, only have time for this on weekends.

  1. The reason exons are missing from the results is probably that the median RC across samples was too low (medianRC < minMedianRC by default 30). These exons should be listed in the message that is output by runPanelcnMops ("Cannot use exon xxx"). You can change the threshold if you want to keep them, but the results will be less reliable.
  2. Unfortunately my fix for the order did not check first if the exons are in the results and therefore broke the plotting function. That should be fixed in the new version.

Thank you for your patience. It would be great if you could try one more time.