Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

snpsBySeqname and snpsByOverlaps does not work. #65

Closed YingkaiSun closed 1 year ago

YingkaiSun commented 1 year ago

My codes are as below:

library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
library(BSgenome.Hsapiens.NCBI.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
snpsById(snps,'rs2531267') 
snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38")
snpsBySeqname(snps,c('1','MT'))

return error as below

# > snpsById(snps,'rs2531267') 
# UnstitchedGPos object with 1 position and 2 metadata columns:
#      seqnames       pos strand |   RefSNP_id alleles_as_ambig
#         <Rle> <integer>  <Rle> | <character>      <character>
#  [1]        1     69569      * |   rs2531267                Y
#  -------
#  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

# > snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38")
# Error in extract_rowids(x_rowids_env, rowidx) : 
#  is.integer(idx) is not TRUE

# > snpsBySeqname(snps,c('1','MT'))
# Error in extract_rowids(x_rowids_env, rowidx) : 
#  is.integer(idx) is not TRUE

Running on M2 max Macbook pro, Macos Ventura v13.0, 96G RAM

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.0

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reshape2_1.4.4                           forcats_1.0.0                            stringr_1.5.0                           
 [4] dplyr_1.1.0                              purrr_1.0.1                              readr_2.1.4                             
 [7] tidyr_1.3.0                              tibble_3.1.8                             ggplot2_3.4.1                           
[10] tidyverse_1.3.2                          BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000   SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20
[13] BSgenome_1.66.2                          rtracklayer_1.58.0                       Biostrings_2.66.0                       
[16] XVector_0.38.0                           GenomicRanges_1.50.2                     GenomeInfoDb_1.34.9                     
[19] IRanges_2.32.0                           S4Vectors_0.36.1                         BiocGenerics_0.44.0                     

loaded via a namespace (and not attached):
  [1] googledrive_2.0.0           colorspace_2.1-0            rjson_0.2.21                ellipsis_0.3.2             
  [5] futile.logger_1.4.3         fs_1.6.1                    rstudioapi_0.14             bedr_1.0.7                 
  [9] bit64_4.0.5                 AnnotationDbi_1.60.0        fansi_1.0.4                 lubridate_1.9.2            
 [13] xml2_1.3.3                  codetools_0.2-19            R.methodsS3_1.8.2           cachem_1.0.6               
 [17] jsonlite_1.8.4              Rsamtools_2.14.0            packrat_0.9.0               broom_1.0.3                
 [21] dbplyr_2.3.0                png_0.1-8                   R.oo_1.25.0                 BiocManager_1.30.19        
 [25] compiler_4.2.2              httr_1.4.4                  backports_1.4.1             assertthat_0.2.1           
 [29] Matrix_1.5-3                fastmap_1.1.0               gargle_1.3.0                cli_3.6.0                  
 [33] formatR_1.14                prettyunits_1.1.1           tools_4.2.2                 gtable_0.3.1               
 [37] glue_1.6.2                  GenomeInfoDbData_1.2.9      rappdirs_0.3.3              Rcpp_1.0.10                
 [41] Biobase_2.58.0              cellranger_1.1.0            vctrs_0.5.2                 brio_1.1.3                 
 [45] rvest_1.0.3                 testthat_3.1.6              timechange_0.2.0            lifecycle_1.0.3            
 [49] restfulr_0.0.15             XML_3.99-0.13               googlesheets4_1.0.1         zlibbioc_1.44.0            
 [53] scales_1.2.1                VariantAnnotation_1.44.0    hms_1.1.2                   MatrixGenerics_1.10.0      
 [57] parallel_4.2.2              SummarizedExperiment_1.28.0 lambda.r_1.2.4              yaml_2.3.7                 
 [61] curl_5.0.0                  memoise_2.0.1               biomaRt_2.54.0              stringi_1.7.12             
 [65] RSQLite_2.2.20              BiocIO_1.8.0                gwasvcf_0.1.1               GenomicFeatures_1.50.4     
 [69] filelock_1.0.2              BiocParallel_1.32.5         rlang_1.0.6                 pkgconfig_2.0.3            
 [73] matrixStats_0.63.0          bitops_1.0-7                lattice_0.20-45             GenomicAlignments_1.34.0   
 [77] bit_4.0.5                   tidyselect_1.2.0            plyr_1.8.8                  magrittr_2.0.3             
 [81] R6_2.5.1                    generics_0.1.3              DelayedArray_0.24.0         DBI_1.1.3                  
 [85] withr_2.5.0                 pillar_1.8.1                haven_2.5.1                 KEGGREST_1.38.0            
 [89] RCurl_1.98-1.10             modelr_0.1.10               crayon_1.5.2                futile.options_1.0.1       
 [93] utf8_1.2.3                  BiocFileCache_2.6.0         tzdb_0.3.0                  progress_1.2.2             
 [97] readxl_1.4.2                grid_4.2.2                  data.table_1.14.6           blob_1.2.3                 
[101] reprex_2.0.2                digest_0.6.31               VennDiagram_1.7.3           R.utils_2.12.2             
[105] munsell_0.5.0  
vjcitn commented 1 year ago

Thanks for the careful reporting. I cannot reproduce this on a linux host. sessionInfo below.

> source("err.R", echo=TRUE)

> library(SNPlocs.Hsapiens.dbSNP144.GRCh38)

> library(BSgenome.Hsapiens.NCBI.GRCh38)

> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38

> snpsById(snps,'rs2531267') 
UnstitchedGPos object with 1 position and 2 metadata columns:
      seqnames       pos strand |   RefSNP_id alleles_as_ambig
         <Rle> <integer>  <Rle> | <character>      <character>
  [1]        1     69569      * |   rs2531267                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

> snpsByOverlaps(snps, "X:3e6-8e6", genome="GRCh38")
UnstitchedGPos object with 166953 positions and 5 metadata columns:
           seqnames       pos strand |   RefSNP_id alleles_as_ambig
              <Rle> <integer>  <Rle> | <character>      <character>
       [1]        X   3000004      * | rs369882522                Y
       [2]        X   3000013      * | rs374307143                Y
       [3]        X   3000014      * |  rs73437584                R
       [4]        X   3000036      * | rs113897265                R
       [5]        X   3000038      * |  rs79982205                M
       ...      ...       ...    ... .         ...              ...
  [166949]        X   7999835      * | rs775860267                S
  [166950]        X   7999839      * | rs368090328                R
  [166951]        X   7999902      * | rs190898710                R
  [166952]        X   7999926      * | rs772716820                R
  [166953]        X   7999951      * | rs181626818                R
           genome_compat  ref_allele     alt_alleles
               <logical> <character> <CharacterList>
       [1]          TRUE           C               T
       [2]          TRUE           C               T
       [3]          TRUE           G               A
       [4]          TRUE           A               G
       [5]          TRUE           A               C
       ...           ...         ...             ...
  [166949]          TRUE           C               G
  [166950]          TRUE           A               G
  [166951]          TRUE           G               A
  [166952]          TRUE           A               G
  [166953]          TRUE           G               A
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

> snpsBySeqname(snps,c('1','MT'))
UnstitchedGPos object with 10354168 positions and 2 metadata columns:
             seqnames       pos strand |   RefSNP_id alleles_as_ambig
                <Rle> <integer>  <Rle> | <character>      <character>
         [1]        1     10108      * |  rs62651026                Y
         [2]        1     10109      * | rs376007522                W
         [3]        1     10139      * | rs368469931                W
         [4]        1     10150      * | rs371194064                Y
         [5]        1     10177      * | rs201752861                M
         ...      ...       ...    ... .         ...              ...
  [10354164]       MT     16512      * | rs373943637                Y
  [10354165]       MT     16519      * |   rs3937033                Y
  [10354166]       MT     16528      * | rs386829315                Y
  [10354167]       MT     16529      * | rs370705831                Y
  [10354168]       MT     16529      * | rs386829316                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome
> sessionInfo()
R Under development (unstable) (2023-01-17 r83632)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /home/stvjc/R-devel-dist/lib/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20
 [2] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000  
 [3] BSgenome_1.67.3                         
 [4] rtracklayer_1.59.1                      
 [5] Biostrings_2.67.0                       
 [6] XVector_0.39.0                          
 [7] GenomicRanges_1.51.4                    
 [8] GenomeInfoDb_1.35.15                    
 [9] IRanges_2.33.0                          
[10] S4Vectors_0.37.3                        
[11] BiocGenerics_0.45.0                     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2                DelayedArray_0.25.0        
 [3] SummarizedExperiment_1.29.1 GenomicAlignments_1.35.0   
 [5] rjson_0.2.21                RCurl_1.98-1.10            
 [7] XML_3.99-0.13               MatrixGenerics_1.11.0      
 [9] Biobase_2.59.0              grid_4.3.0                 
[11] restfulr_0.0.15             bitops_1.0-7               
[13] yaml_2.3.7                  BiocManager_1.30.19        
[15] compiler_4.3.0              codetools_0.2-19           
[17] BiocParallel_1.33.9         lattice_0.20-45            
[19] BiocIO_1.9.2                parallel_4.3.0             
[21] GenomeInfoDbData_1.2.9      Matrix_1.5-3               
[23] tools_4.3.0                 matrixStats_0.63.0         
[25] Rsamtools_2.15.1            zlibbioc_1.45.0      
vjcitn commented 1 year ago

I confirm the error on an M1 mac. Digging in now.

vjcitn commented 1 year ago

I can't replicate with Bioc 3.16 on linux either

> sessionInfo()
R version 4.2.2 Patched (2022-12-25 r83511)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /home/stvjc/R-4-2-dist/lib/R/lib/libRblas.so
LAPACK: /home/stvjc/R-4-2-dist/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000  
 [2] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20
 [3] BSgenome_1.66.2                         
 [4] rtracklayer_1.58.0                      
 [5] Biostrings_2.66.0                       
 [6] XVector_0.38.0                          
 [7] GenomicRanges_1.50.2                    
 [8] GenomeInfoDb_1.34.9                     
 [9] IRanges_2.32.0                          
[10] S4Vectors_0.36.1                        
[11] BiocGenerics_0.44.0                     

loaded via a namespace (and not attached):
 [1] zlibbioc_1.44.0             GenomicAlignments_1.34.0   
 [3] BiocParallel_1.32.5         lattice_0.20-45            
 [5] rjson_0.2.21                tools_4.2.2                
 [7] grid_4.2.2                  SummarizedExperiment_1.28.0
 [9] parallel_4.2.2              Biobase_2.58.0             
[11] matrixStats_0.63.0          yaml_2.3.6                 
[13] crayon_1.5.2                BiocIO_1.8.0               
[15] Matrix_1.5-3                GenomeInfoDbData_1.2.9     
[17] BiocManager_1.30.19         restfulr_0.0.15            
[19] bitops_1.0-7                codetools_0.2-18           
[21] RCurl_1.98-1.9              DelayedArray_0.24.0        
[23] compiler_4.2.2              MatrixGenerics_1.10.0      
[25] Rsamtools_2.14.0            XML_3.99-0.13  

So something macOS-specific in BSgenome?? At this point

 Error in extract_rowids(x_rowids_env, rowidx) : 
#  is.integer(idx) is not TRUE

which is a call to a function in BSgenome, rowidx is an IRanges on macOS

YingkaiSun commented 1 year ago

Thanks for your kindly response! Vince. This problem have confused me a whole day, Is that soluble? or maybe I have to change back to an intel Mac?

hpages commented 1 year ago

Could it be that you guys installed BSgenome directly from GitHub at some point by any chance? I think I remember introducing this error in the package and fixing it a couple of days later on GitHub, but at the time I didn't feel the need to bump the version because the error never propagated to the BioC repos. This was at the beginning of January. With hindsight I should have probably bumped it anyways to avoid any possible confusion, sorry for that.

FWIW the fix is in commit 0b63ff052e9b10472f8cda8a415c4fccf53715f5

Anyways, can you try to reinstall the latest BSgenome with BiocManager::install("BSgenome", force=TRUE) and see if the problem goes away? Thanks!

YingkaiSun commented 1 year ago

Thanks! I have tried to reinstall both BSgenome and IRanges, but it does not work.

vjcitn commented 1 year ago

With this sessionInfo, it works on the M1 mac. I build BSgenome from source and reinstalled.

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000  
 [2] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20
 [3] BSgenome_1.66.2                         
 [4] rtracklayer_1.58.0                      
 [5] Biostrings_2.66.0                       
 [6] XVector_0.38.0                          
 [7] GenomicRanges_1.50.2                    
 [8] GenomeInfoDb_1.34.9                     
 [9] IRanges_2.32.0                          
[10] S4Vectors_0.36.1                        
[11] BiocGenerics_0.44.0                     
[12] rmarkdown_2.20                          

loaded via a namespace (and not attached):
 [1] compiler_4.2.2              restfulr_0.0.15            
 [3] MatrixGenerics_1.10.0       bitops_1.0-7               
 [5] tools_4.2.2                 zlibbioc_1.44.0            
 [7] digest_0.6.31               lattice_0.20-45            
 [9] evaluate_0.20               rlang_1.0.6                
[11] Matrix_1.5-3                DelayedArray_0.24.0        
[13] cli_3.6.0                   yaml_2.3.7                 
[15] parallel_4.2.2              xfun_0.37                  
[17] fastmap_1.1.0               GenomeInfoDbData_1.2.9     
[19] knitr_1.42                  grid_4.2.2                 
[21] Biobase_2.58.0              XML_3.99-0.13              
[23] BiocParallel_1.32.5         startup_0.19.0             
[25] Rsamtools_2.14.0            codetools_0.2-19           
[27] htmltools_0.5.4             matrixStats_0.63.0         
[29] GenomicAlignments_1.34.0    SummarizedExperiment_1.28.0
[31] RCurl_1.98-1.10             crayon_1.5.2               
[33] rjson_0.2.21                BiocIO_1.8.0               
> 
vjcitn commented 1 year ago

I think this mystery is solved. I installed the binary and it FAILED. Not bumping the version number is the problem. There is a bad binary in the distribution.

vjcitn commented 1 year ago

Somehow the build system got the unintended changes @hpages . I will try to make an accessible binary for OP

vjcitn commented 1 year ago

@YingkaiSun send your email addr to stvjc at channing dot harvard dot edu and I will send you a repaired binary

Or you could wait a few days for the next version to propagate.

hpages commented 1 year ago

arghh!!.. thanks @vjcitn for catching this. I've no idea how this binary could make it to the public repo! Let me bump BSgenome version right now.

@YingkaiSun In the mean time please install with BiocManager::install("BSgenome", type="source", force=TRUE). Sorry for that.

hpages commented 1 year ago

I've no idea how this binary could make it to the public repo!

Oh, I know, the Mac arm64 builders don't run CHECK, that's why!

hpages commented 1 year ago

Version bumped to 1.66.3 in release. (Was already bumped several times in devel after the fix so all is good there.)

YingkaiSun commented 1 year ago

@hpages @vjcitn Thanks! it works!