Introns in peak detection of RNA samples

levuBGU commented 11 months ago

Hi,

I utilized exomePeak2 for peak detection in RNA samples and observed that many methylated regions corresponded to introns. This finding surprised me since I did not specify the mode argument, and according to my understanding, the default is set to "exon."

The workflow involved taking the exomePeak2 bed file output, converting it into a bigWig file, uploading it to the UCSC genome browser custom track, and subsequently discovering the presence of introns.

For this analysis, I used paired-end BAM files as input and an NCBI GTF file (ncbiRefSeq.gtf for Rattus norvegicus). The specified arguments in the exomePeak2 function were as follows:

exomePeak2(bam_ip=control_ip_files,
           bam_input=control_input_files,
           gff=ncbiRefSeq.gtf,
           save_dir="my/experiment/path",
           experiment_name="control_input_ip_diff",
           genome="rn7",
           txdb=makeTxDbFromGFF(ncbiRefSeq.gtf, format="gtf"))

I am also adding the text from my log file when executing the R code:

start
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: ‘MatrixGenerics’

The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: ‘Biobase’

The following object is masked from ‘package:MatrixGenerics’:

    rowMedians

The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians

Warning messages:
1: package ‘exomePeak2’ was built under R version 4.2.2 
2: package ‘SummarizedExperiment’ was built under R version 4.2.2 
3: package ‘MatrixGenerics’ was built under R version 4.2.1 
4: package ‘BiocGenerics’ was built under R version 4.2.1 
5: package ‘IRanges’ was built under R version 4.2.1 
6: package ‘GenomeInfoDb’ was built under R version 4.2.3 
7: package ‘Biobase’ was built under R version 4.2.1 
Loading required package: AnnotationDbi
Warning messages:
1: package ‘GenomicFeatures’ was built under R version 4.2.2 
2: package ‘AnnotationDbi’ was built under R version 4.2.2 
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... OK
Fit GC curves with smoothing splines ... OK
Calculate offset matrix ... OK
Detect peaks with GLM ... OK
GRangesList object of length 34629:
$`1`
GRanges object with 1 range and 0 metadata columns:
    seqnames              ranges strand
       <Rle>           <IRanges>  <Rle>
  1     chr4 154898070-154898094      +
  -------
  seqinfo: 80 sequences from an unspecified genome; no seqlengths

$`2`
GRanges object with 1 range and 0 metadata columns:
    seqnames    ranges strand
       <Rle> <IRanges>  <Rle>
  2     chrM 7758-9382      +
  -------
  seqinfo: 80 sequences from an unspecified genome; no seqlengths

$`3`
GRanges object with 1 range and 0 metadata columns:
    seqnames            ranges strand
       <Rle>         <IRanges>  <Rle>
  3    chr12 31113498-31114297      -
  -------
  seqinfo: 80 sequences from an unspecified genome; no seqlengths

...
<34626 more elements>
Warning messages:
1: In .merge_two_Seqinfo_objects(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr19_NW_023637714v1_random, chr19_NW_023637715v1_random, chrUn_NW_023637730v1, chrUn_NW_023637732v1, chrUn_NW_023637734v1, chrUn_NW_023637735v1, chrUn_NW_023637736v1, chrUn_NW_023637737v1, chrUn_NW_023637738v1, chrUn_NW_023637740v1, chrUn_NW_023637741v1, chrUn_NW_023637742v1, chrUn_NW_023637743v1, chrUn_NW_023637744v1, chrUn_NW_023637745v1, chrUn_NW_023637747v1, chrUn_NW_023637748v1, chrUn_NW_023637749v1, chrUn_NW_023637750v1, chrUn_NW_023637751v1, chrUn_NW_023637752v1, chrUn_NW_023637753v1, chrUn_NW_023637756v1, chrUn_NW_023637757v1, chrUn_NW_023637758v1, chrUn_NW_023637760v1, chrUn_NW_023637761v1, chrUn_NW_023637762v1, chrUn_NW_023637765v1, chrUn_NW_023637766v1, chrUn_NW_023637767v1, chrUn_NW_023637768v1, chrUn_NW_023637769v1, chrUn_NW_023637772v1, chrUn_NW_023637773v1, chrUn_NW_023637775v1, chrUn_NW_023637776v1, chrUn_NW_023637779v1, chrUn_NW_023637781v1, chrUn_NW_023637782v1, chrUn_NW_023637783v1, chr [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 36 out-of-bound ranges located on sequences
  Ang,LOC108352909, LOC120094777, LOC103691088, LOC108351262,
  LOC103693688, LOC120094123,Vof16, LOC102551251, LOC120097087,
  LOC120098939, LOC108350834, LOC120095272, LOC120093965, LOC120094883,
  LOC120097370, LOC120097467, LOC120094794, Galnt4, LOC120095229,
  LOC108352196, LOC120100348, LOC120096115, LOC120096778, LOC120100476,
  LOC120100393, LOC120103263, LOC120100962, LOC120093625, LOC102552101,
  LOC120100737, LOC103692527, LOC120095249, LOC108350542, LOC120097094,
  LOC120102858, LOC120095738, and LOC120093583. Note that ranges located
  on a sequence whose length is unknown (NA) or on a circular sequence
  are not considered out-of-bound (use seqlengths() and isCircular() to
  get the lengths and circularity flags of the underlying sequences). You
  can use trim() to trim these ranges. See ?`trim,GenomicRanges-method`
  for more information.
finished.

I would greatly appreciate any suggestions or insights.

ZW-xjtlu commented 11 months ago

Dear exomePeak2 User,

Thank you for reaching out regarding the unexpected findings in your RNA sample analysis using exomePeak2. Your observation of many methylated regions corresponding to introns, despite not specifying the "exon" mode, is intriguing. The root of this issue likely stems from the gene annotation file discrepancies.

You have used "ncbiRefSeq.gtf" for gene annotation, whereas the UCSC Genome Browser primarily utilizes its UCSC Known Genes dataset. It's important to understand that gene annotations in the NCBI RefSeq database and the UCSC Known Genes are not identical, though they share similarities. This difference can potentially influence your analysis outcomes.

Here are some key distinctions between these two annotation sources:

Source and Methods: RefSeq, managed by the NCBI, offers a well-integrated set of sequences, including DNA, transcripts, and proteins, sourced from various databases and curated by the NCBI team. Conversely, UCSC Known Genes are compiled by aligning protein and mRNA sequences to the human genome, deriving data from RefSeq, GenBank, UniProt, and the ENCODE project.

Annotation Philosophy: RefSeq aims to present a singular, high-quality representative sequence for each gene, which might limit the representation of alternative splicing variants. In contrast, UCSC Known Genes encompass a wider array of splice variants, which might lead to more gene and transcript models, particularly in genes with complex splicing patterns.

Update Frequency and Versioning: Both databases undergo regular updates, but the timing and methods may differ, leading to potential discrepancies at any given moment.

Organism Specificity: While RefSeq encompasses a broad range of organisms, UCSC Known Genes primarily focus on the human genome, along with other model organisms.

These differences underline why your results may not align with the expected "exon" mode defaults. For a more comprehensive understanding, cross-referencing both RefSeq and UCSC Known Genes is advisable.

Should you have further questions or require additional assistance, please do not hesitate to contact me.

Best Regards, Zhen Wei

levuBGU commented 11 months ago

Hi Zhen,

Thank you for the detailed explanation. I've found a path on the UCSC website leading to two annotation files: https://hgdownload.soe.ucsc.edu/goldenPath/rn7/bigZips/genes/.

Given that UCSC annotations are sourced from multiple databases, is there a straightforward method to perform cross-referencing or to utilize UCSC annotations directly in exomePeak2, possibly through a specific argument or setting?

Thanks again for your assistance.

Best regards, Uri Levy

ZW-xjtlu / exomePeak2

Introns in peak detection of RNA samples #41