RJWANGbioinfo / APAlyzer

APAlyzer is a toolkit for bioinformatic analysis of alternative polyadenylation (APA) events using RNA sequencing data. Our main approach is the comparison of sequencing reads in regions demarcated by high-quality polyadenylation sites (PASs) annotated in the PolyA_DB database (https://exon.apps.wistar.org/PolyA_DB/v3/). The current version (v3.0) uses RNA-seq data to examine APA events in 3’ untranslated regions (3’UTRs) and in introns. The coding regions are used for gene expression calculation.
https://bioconductor.org/packages/release/bioc/html/APAlyzer.html
GNU Lesser General Public License v3.0
7 stars 4 forks source link

PASEXP_IPA 'Error in dflength$end - dflength$start: non-numeric argument to binary operator' from GTF constructed reference #19

Closed reck999 closed 7 months ago

reck999 commented 7 months ago

I was able to successfully analyze 3' UTR from the dataset GSE230025 aligned to the UCSC genome after creating a reference for C. elegans using the PAS2GEF function from the Ensembl GTF. When I went to analyze intronic APA, I received the error 'dflength$end - dflength$start: non-numeric argument to binary operator' from PASEXP_IPA. I have included my code and session info below. Is there an error in my reference construction or code that could explain this roadblock? Is there anything I can correct to run the analysis? I am happy to provide more information or bam files. Thank you so much for this great package!

setwd("E:/Celegans_TDP1_UCSC") library(APAlyzer) library(repmis) library(GenomicRanges) Loading required package: stats4 Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq,
Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:utils’:

findMatches

The following objects are masked from ‘package:base’:

expand.grid, I, unname

Loading required package: IRanges

Attaching package: ‘IRanges’

The following object is masked from ‘package:grDevices’:

windows

Loading required package: GenomeInfoDb

Building a worm reference

download.file(url='https://ftp.ensembl.org/pub/release-111/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.111.gtf.gz',

GTFfile="Caenorhabditis_elegans.WBcel235.111.gtf.gz" PASREFraw=PAS2GEF(GTFfile)
[1] "PAS2GEF: Reading GTF file" [1] "PAS2GEF: Extracting and annotating all PASs" [1] "PAS2GEF: Extracting and filtering 3'UTR PASs" [1] "PAS2GEF: Extracting IPAs" [1] "PAS2GEF: Extracting 3' last exons" [1] "PAS2GEF: Finalizing references" Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. refUTRraw=PASREFraw$refUTRraw dfIPAraw=PASREFraw$dfIPA dfLEraw=PASREFraw$dfLE PASREF=REF4PAS(refUTRraw,dfIPAraw,dfLEraw) dfIPA=PASREF$dfIPA dfLE=PASREF$dfLE
UTRdbraw=REF3UTR(refUTRraw)

RNA-seq BAM files

flsall <- dir(getwd(),".bam") flsall<-paste0(getwd(),'/',flsall) names(flsall)<-gsub('.bam','',dir(getwd(),".bam"))

Calculation of UTR and IPA

IPA_OUTraw=PASEXP_IPA(PASREF$dfIPA,dfLE, flsall, SeqType ='ThreeMostPairEnd') Error in dflength$end - dflength$start : non-numeric argument to binary operator sessionInfo() R version 4.3.3 (2024-02-29 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C LC_TIME=English_United States.utf8

time zone: America/Los_Angeles tzcode source: internal

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicRanges_1.54.1 GenomeInfoDb_1.38.8 IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1 repmis_0.5
[7] APAlyzer_1.16.0

loaded via a namespace (and not attached): [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4 R.utils_2.12.3
[5] filelock_1.0.3 Biostrings_2.70.3 bitops_1.0-7 fastmap_1.1.1
[9] RCurl_1.98-1.14 BiocFileCache_2.10.1 VariantAnnotation_1.48.1 GenomicAlignments_1.38.2
[13] XML_3.99-0.16.1 digest_0.6.35 lifecycle_1.0.4 KEGGREST_1.42.0
[17] RSQLite_2.3.5 magrittr_2.0.3 compiler_4.3.3 rlang_1.1.3
[21] progress_1.2.3 tools_4.3.3 utf8_1.2.4 yaml_2.3.8
[25] data.table_1.15.2 rtracklayer_1.62.0 prettyunits_1.2.0 S4Arrays_1.2.1
[29] bit_4.0.5 curl_5.2.1 DelayedArray_0.28.0 plyr_1.8.9
[33] xml2_1.3.6 abind_1.4-5 BiocParallel_1.34.2 R.cache_0.16.0
[37] purrr_1.0.2 R.oo_1.26.0 grid_4.3.3 fansi_1.0.6
[41] colorspace_2.1-0 ggplot2_3.5.0 scales_1.3.0 biomaRt_2.58.2
[45] Rsubread_2.16.1 SummarizedExperiment_1.32.0 cli_3.6.2 crayon_1.5.2
[49] generics_0.1.3 HybridMTest_1.46.0 rstudioapi_0.16.0 httr_1.4.7
[53] rjson_0.2.21 DBI_1.2.2 cachem_1.0.8 stringr_1.5.1
[57] zlibbioc_1.48.2 parallel_4.3.3 AnnotationDbi_1.64.1 XVector_0.42.0
[61] restfulr_0.0.15 matrixStats_1.2.0 vctrs_0.6.5 Matrix_1.6-5
[65] hms_1.1.3 bit64_4.0.5 ggrepel_0.9.5 GenomicFeatures_1.54.4
[69] locfit_1.5-9.9 tidyr_1.3.1 glue_1.7.0 codetools_0.2-19
[73] stringi_1.8.3 gtable_0.3.4 BiocIO_1.12.0 munsell_0.5.0
[77] tibble_3.2.1 pillar_1.9.0 rappdirs_0.3.3 BSgenome_1.70.2
[81] GenomeInfoDbData_1.2.11 R6_2.5.1 dbplyr_2.5.0 lattice_0.22-6
[85] Biobase_2.60.0 R.methodsS3_1.8.2 png_0.1-8 Rsamtools_2.18.0
[89] memoise_2.0.1 Rcpp_1.0.12 SparseArray_1.2.4 DESeq2_1.42.1
[93] MatrixGenerics_1.14.0 pkgconfig_2.0.3

Toutouflipi commented 7 months ago

Somebody before used this code for this I think:

ensure that coordinates are numeric

dfIPA$Pos = as.numeric(as.character(dfIPA$Pos)) dfIPA$upstreamSS = as.numeric(as.character(dfIPA$upstreamSS)) dfIPA$downstreamSS = as.numeric(as.character(dfIPA$downstreamSS)) dfLE$LEstart = as.numeric(as.character(dfLE$LEstart)) dfLE$TES = as.numeric(as.character(dfLE$TES))

(it helped for me when I had a similar issue).

Good luck!!

reck999 commented 7 months ago

This worked! Thank you for the quick response! Closing this thread now.