Open farshadf opened 5 years ago
Can you please provide your sessionInfo()
? Thanks.
Sorry, I should have added that already:
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS release 6.9 (Final)
Matrix products: default
BLAS/LAPACK: /data/miniconda3/lib/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] xtable_1.8-3 rtracklayer_1.42.2 Rsamtools_1.34.0
[4] Biostrings_2.50.2 XVector_0.22.0 GenomicRanges_1.34.0
[7] GenomeInfoDb_1.18.1 IRanges_2.16.0 S4Vectors_0.20.1
[10] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.28.0 GenomicAlignments_1.18.1
[3] BiocParallel_1.14.2 BSgenome_1.50.0
[5] lattice_0.20-38 tools_3.5.1
[7] SummarizedExperiment_1.12.0 grid_3.5.1
[9] Biobase_2.42.0 matrixStats_0.54.0
[11] yaml_2.2.0 Matrix_1.2-15
[13] GenomeInfoDbData_1.2.1 BiocManager_1.30.4
[15] bitops_1.0-6 RCurl_1.95-4.11
[17] DelayedArray_0.8.0 compiler_3.5.1
[19] XML_3.98-1.16
Thanks but in your original post you said you were using Rsamtools 2.0.2 but now your sessionInfo()
reports that you were using Rsamtools 1.34.0. We need to see the full transcript of the R session that is producing the error. The transcript should contain the call to sessionInfo()
at the end of the session. So we know exactly what command you used, what error you got, on which OS you are, and what version of Bioconductor/R you use. Thanks!
I am having a similar issue. The error happens from chromosome 13 onwards. This is in windows
library(Rsamtools)
> fastaFile<-"F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta"
> FASTA <- Rsamtools::FaFile(fastaFile)
> FASTA
class: FaFile
path: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
index: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta.fai
gzindex: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta.gzi
isOpen: FALSE
yieldSize: NA
> for (chrom in as.character(c(1:22,"X","Y"))){
+ try(
+ print(getSeq(FASTA, GRanges(chrom, IRanges(41194312,41194322))))
+ )
+ }
A DNAStringSet instance of length 1
width seq names
[1] 11 TTTTGTTTTGT 1
A DNAStringSet instance of length 1
width seq names
[1] 11 GCCATAAAAAA 2
A DNAStringSet instance of length 1
width seq names
[1] 11 GTATTTACAAA 3
A DNAStringSet instance of length 1
width seq names
[1] 11 CCCAAAGGAAA 4
A DNAStringSet instance of length 1
width seq names
[1] 11 TGGATAAGAAT 5
A DNAStringSet instance of length 1
width seq names
[1] 11 GTAATGCCATT 6
A DNAStringSet instance of length 1
width seq names
[1] 11 TTTGATTTAAG 7
A DNAStringSet instance of length 1
width seq names
[1] 11 ATTCCAGCACC 8
A DNAStringSet instance of length 1
width seq names
[1] 11 TAAGTTGGGGA 9
A DNAStringSet instance of length 1
width seq names
[1] 11 NNNNNNNNNNN 10
A DNAStringSet instance of length 1
width seq names
[1] 11 ATTTTTAAAAC 11
A DNAStringSet instance of length 1
width seq names
[1] 11 TTAAGCAATAT 12
Error in value[[3L]](cond) : record 1 (13:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (14:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (15:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (16:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (17:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (18:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (19:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (20:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (21:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (22:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (X:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Error in value[[3L]](cond) : record 1 (Y:41194312-41194322) failed
file: F:/ahwan.pandey/Projects/TargetedExome/Project_DG/MOCOG/ref_test/human_g1k_v37.fasta
Here's my sessionInfo():
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsamtools_2.2.3 Biostrings_2.54.0 XVector_0.26.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2 S4Vectors_0.24.4
[8] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 zlibbioc_1.32.0 data.table_1.12.8 BiocParallel_1.20.1 tools_3.6.3 RCurl_1.98-1.2 compiler_3.6.3
[8] GenomeInfoDbData_1.2.2
I installed Rsamtools_2.4.0
and the same error persists.
Another thing I tried was to delete the fasta index and let Rsamtools create the index while running. The index it creates is weird from chr14 onwards.
It appends "1844674407" i.e. 2^64 to the third column:
1 249250621 52 60 61
2 243199373 253404903 60 61
3 198022430 500657651 60 61
4 191154276 701980507 60 61
5 180915260 896320740 60 61
6 171115067 1080251307 60 61
7 159138663 1254218344 60 61
8 146364022 1416009371 60 61
9 141213431 1564812846 60 61
10 135534747 1708379889 60 61
11 135006516 1846173603 60 61
12 133851895 1983430282 60 61
13 115169878 2119513096 60 61
14 107349540 18446744071651186846 60 61
15 102531392 18446744071760325599 60 61
16 90354753 18446744071864565901 60 61
17 81195210 18446744071956426620 60 61
18 78077248 18446744072038975137 60 61
19 59128983 18446744072118353726 60 61
20 63025520 18446744072178468246 60 61
21 48129895 18446744072242544245 60 61
22 51304566 18446744072291476358 60 61
X 155270560 18446744072343636053 60 61
Y 59373566 18446744072501494513 60 61
MT 16569 18446744072561857717 70 71
GL000207.1 4262 18446744072561874585 60 61
GL000226.1 15008 18446744072561878981 60 61
GL000229.1 19913 18446744072561894302 60 61
GL000231.1 27386 18446744072561914609 60 61
GL000210.1 27682 18446744072561942514 60 61
GL000239.1 33824 18446744072561970720 60 61
GL000235.1 34474 18446744072562005170 60 61
GL000201.1 36148 18446744072562040281 60 61
GL000247.1 36422 18446744072562077094 60 61
GL000245.1 36651 18446744072562114186 60 61
GL000197.1 37175 18446744072562151510 60 61
GL000203.1 37498 18446744072562189367 60 61
GL000246.1 38154 18446744072562227552 60 61
GL000249.1 38502 18446744072562266404 60 61
GL000196.1 38914 18446744072562305610 60 61
GL000248.1 39786 18446744072562345235 60 61
GL000244.1 39929 18446744072562385747 60 61
GL000238.1 39939 18446744072562426404 60 61
GL000202.1 40103 18446744072562467071 60 61
GL000234.1 40531 18446744072562507905 60 61
GL000232.1 40652 18446744072562549174 60 61
GL000206.1 41001 18446744072562590566 60 61
GL000240.1 41933 18446744072562632313 60 61
GL000236.1 41934 18446744072562675007 60 61
GL000241.1 42152 18446744072562717702 60 61
GL000243.1 43341 18446744072562760619 60 61
GL000242.1 43523 18446744072562804745 60 61
GL000230.1 43691 18446744072562849056 60 61
GL000237.1 45867 18446744072562893538 60 61
GL000233.1 45941 18446744072562940232 60 61
GL000204.1 81310 18446744072562987001 60 61
GL000198.1 90085 18446744072563069729 60 61
GL000208.1 92689 18446744072563161378 60 61
GL000191.1 106433 18446744072563255675 60 61
GL000227.1 128374 18446744072563363945 60 61
GL000228.1 129120 18446744072563494522 60 61
GL000214.1 137718 18446744072563625857 60 61
GL000221.1 155397 18446744072563765934 60 61
GL000209.1 159169 18446744072563923984 60 61
GL000218.1 161147 18446744072564085869 60 61
GL000220.1 161802 18446744072564249765 60 61
GL000213.1 164239 18446744072564414327 60 61
GL000211.1 166566 18446744072564581367 60 61
GL000199.1 169874 18446744072564750773 60 61
GL000217.1 172149 18446744072564923542 60 61
GL000216.1 172294 18446744072565098624 60 61
GL000215.1 172545 18446744072565273853 60 61
GL000205.1 174588 18446744072565449337 60 61
GL000219.1 179198 18446744072565626898 60 61
GL000224.1 179693 18446744072565809146 60 61
GL000223.1 180455 18446744072565991897 60 61
GL000195.1 182896 18446744072566175423 60 61
GL000212.1 186858 18446744072566361431 60 61
GL000222.1 186861 18446744072566551467 60 61
GL000200.1 187035 18446744072566741506 60 61
GL000193.1 189789 18446744072566931722 60 61
GL000194.1 191469 18446744072567124738 60 61
GL000225.1 211173 18446744072567319462 60 61
GL000192.1 547496 18446744072567534218 60 61
I have the same issue on Windows; I'll possibly try to dig around with a debugger in the C part of the library at some point later this week, and also try running on different operating systems.
My sessionInfo()
:
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] Rsamtools_2.8.0 Biostrings_2.60.1 XVector_0.32.0 motifmatchr_1.14.0 chromVAR_1.14.0
[6] SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 modules_0.10.0
[11] ggplot2_3.3.3 tibble_3.1.2 irlba_2.3.3 uwot_0.1.10 HDF5Array_1.20.0
[16] DelayedArray_0.18.0 IRanges_2.26.0 S4Vectors_0.30.0 MatrixGenerics_1.4.0 matrixStats_0.59.0
[21] BiocGenerics_0.38.0 Matrix_1.3-4 rhdf5_2.36.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 DirichletMultinomial_1.34.0 TFBSTools_1.30.0 bit64_4.0.5 httr_1.4.2
[6] tools_4.1.0 utf8_1.2.1 R6_2.5.0 DT_0.18 lazyeval_0.2.2
[11] seqLogo_1.58.0 DBI_1.1.1 colorspace_2.0-1 rhdf5filters_1.4.0 withr_2.4.2
[16] tidyselect_1.1.1 bit_4.0.4 compiler_4.1.0 plotly_4.9.4 rtracklayer_1.52.0
[21] caTools_1.18.2 scales_1.1.1 readr_1.4.0 stringr_1.4.0 digest_0.6.27
[26] R.utils_2.10.1 pkgconfig_2.0.3 htmltools_0.5.1.1 fastmap_1.1.0 BSgenome_1.60.0
[31] htmlwidgets_1.5.3 rlang_0.4.11 rstudioapi_0.13 RSQLite_2.2.7 shiny_1.6.0
[36] BiocIO_1.2.0 generics_0.1.0 jsonlite_1.7.2 BiocParallel_1.26.0 gtools_3.9.2
[41] R.oo_1.24.0 dplyr_1.0.6 RCurl_1.98-1.3 magrittr_2.0.1 GO.db_3.13.0
[46] GenomeInfoDbData_1.2.6 Rcpp_1.0.6 munsell_0.5.0 Rhdf5lib_1.14.1 fansi_0.5.0
[51] R.methodsS3_1.8.1 lifecycle_1.0.0 stringi_1.6.2 yaml_2.2.1 zlibbioc_1.38.0
[56] plyr_1.8.6 grid_4.1.0 blob_1.2.1 promises_1.2.0.1 crayon_1.4.1
[61] miniUI_0.1.1.1 CNEr_1.28.0 lattice_0.20-44 annotate_1.70.0 KEGGREST_1.32.0
[66] hms_1.1.0 pillar_1.6.1 rjson_0.2.20 JASPAR2016_1.20.0 reshape2_1.4.4
[71] TFMPvalue_0.0.8 XML_3.99-0.6 glue_1.4.2 data.table_1.14.0 renv_0.13.2
[76] BiocManager_1.30.15 png_0.1-7 vctrs_0.3.8 httpuv_1.6.1 tidyr_1.1.3
[81] gtable_0.3.0 poweRlaw_0.70.6 purrr_0.3.4 cachem_1.0.5 mime_0.10
[86] xtable_1.8-4 restfulr_0.0.13 pracma_2.3.3 later_1.2.0 viridisLite_0.4.0
[91] GenomicAlignments_1.28.0 AnnotationDbi_1.54.1 memoise_2.0.0 ellipsis_0.3.2
I get the error:
> scanFa(grcm39_genome, peak)
Error in value[[3L]](cond) : record 1 (AY172335.1:71-372) failed
file: 2021-02-22_sciatac/GRCm39.fna
when using a pre-built index (which looks valid). If I move the pre-built index and let RSamtools build it, the index is similarly invalid-looking (very high constant offset):
CM000994.3 195154279 82 80 81
GL456210.1 169725 197593918 80 81
GL456211.1 241735 197765893 80 81
GL456212.1 153618 198010778 80 81
GL456221.1 206961 198166445 80 81
MU069434.1 8412 198376122 80 81
GL456239.1 40056 198384768 80 81
CM000995.3 181755017 198425407 80 81
CM000996.3 159745316 382452444 80 81
CM000997.3 156860686 544194659 80 81
JH584295.1 1976 703016227 80 81
CM000998.3 151758149 703018310 80 81
JH584296.1 199368 856673564 80 81
JH584297.1 205776 856875553 80 81
JH584298.1 184189 857084030 80 81
GL456354.1 195993 857270650 80 81
JH584299.1 953012 857469221 80 81
CM000999.3 149588044 858434228 80 81
CM001000.3 144995196 1009892205 80 81
GL456219.1 175968 1156699969 80 81
CM001001.3 130127694 1156878219 80 81
CM001002.3 124359700 1288632592 80 81
CM001003.3 130530862 1414546872 80 81
CM001004.3 121973369 1546709453 80 81
CM001005.3 120092757 1670207573 80 81
CM001006.3 120883175 1791801573 80 81
CM001007.3 125139656 1914195871 80 81
CM001008.3 104073951 2040899856 80 81
CM001009.3 98008968 2146274815 80 81
CM001010.3 95294699 18446744071660093299 80 81
CM001011.3 90720763 18446744071756579265 80 81
CM001012.3 61420004 18446744071848434121 80 81
CM001013.3 169476592 18446744071910621958 80 81
GL456233.2 559103 18446744072082217136 80 81
CM001014.3 91455967 18446744072082783310 80 81
JH584300.1 182347 18446744072175382599 80 81
JH584301.1 259875 18446744072175567348 80 81
JH584302.1 155838 18446744072175830594 80 81
JH584303.1 158099 18446744072175988502 80 81
GL456367.1 42057 18446744072176148684 80 81
GL456378.1 31602 18446744072176191373 80 81
GL456381.1 25871 18446744072176223477 80 81
GL456382.1 23158 18446744072176249778 80 81
GL456383.1 38659 18446744072176273332 80 81
GL456385.1 35240 18446744072176312581 80 81
GL456390.1 24668 18446744072176348368 80 81
GL456392.1 23629 18446744072176373452 80 81
GL456394.1 24323 18446744072176397484 80 81
GL456359.1 22974 18446744072176422219 80 81
GL456360.1 31704 18446744072176445588 80 81
GL456396.1 21240 18446744072176477796 80 81
GL456372.1 28664 18446744072176499409 80 81
GL456387.1 24685 18446744072176528539 80 81
GL456389.1 28772 18446744072176553640 80 81
GL456370.1 26764 18446744072176582879 80 81
GL456379.1 72385 18446744072176610085 80 81
GL456366.1 47073 18446744072176683482 80 81
GL456368.1 20208 18446744072176731251 80 81
JH584304.1 114452 18446744072176751819 80 81
MU069435.1 31129 18446744072176867809 80 81
AY172335.1 16299 18446744072176899400 80 81
This is for the mouse GRCm39 genome.
If I had to guess, I think we are overflowing a variable / incorrectly sign-extending a smaller integer into a 64-bit integer. The given is 0b1111111111111111111111111111111110100100101001011001010101001000
in binary, e.g. the 32 bit number 0b10100100101001011001010101001000
sign-extended into a 64 bit integer.
This is especially true because the last offset that worked is 2,146,274,815
, suspiciously less than the maximum of an int32 ( 2,147,483,647
)
Specifically in some C++ I've done in the past, Windows has this weird thing about treating long's as 32 bit numbers instead of 64 bit numbers. There might be a long hanging out somewhere instead of the bit-labelled types"uint32_t" and so on.
Hi, I am using Rsamtools 2.0.2, and I have faced an issue on reading sequences from fasta files.
My issue is similar to https://github.com/Bioconductor/Rsamtools/issues/5 , but not exactly. When I try to "getSeq" from fasta file I get an error message. This is my script:
I get this: Error in value[3L] : record 1 (chr1:1-5) failed file: Homo_sapiens.GRCh37.dna.toplevel.fa
If I run it in Windows, I get this: [E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?) Error in value[3L] : record 1 (chr1:1-5) failed file: Homo_sapiens.GRCh37.dna.toplevel.fa
The reference files are standard ensembl reference files and are downloaded from ftp://ftp.ensembl.org/pub/grch37/release-87/fasta/homo_sapiens/dna .
Thank you for your help. Farshad