LTLA / scRNAseq

Clone of the Bioconductor repository for the scRNAseq package.
http://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html
24 stars 12 forks source link

extreme value detection event with new PaulHSCData, not seen in legacy call #44

Open vjcitn opened 8 months ago

vjcitn commented 8 months ago
> z = PaulHSCData(legacy=FALSE)
The value -2^31 was detected in the dataset.
This has been converted to NA within R.
The value -2^31 was detected in the dataset.
This has been converted to NA within R.

seen on windows and linux

> sessionInfo()
R Under development (unstable) (2024-01-10 r85797)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/vincent/R-4-4-dist/lib/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scRNAseq_2.17.1             SingleCellExperiment_1.25.0
 [3] SummarizedExperiment_1.33.3 Biobase_2.63.0             
 [5] GenomicRanges_1.55.3        GenomeInfoDb_1.39.6        
 [7] IRanges_2.37.1              S4Vectors_0.41.3           
 [9] BiocGenerics_0.49.1         MatrixGenerics_1.15.0      
[11] matrixStats_1.2.0           rmarkdown_2.25             

loaded via a namespace (and not attached):
 [1] DBI_1.2.2                bitops_1.0-7             httr2_1.0.0             
 [4] biomaRt_2.59.1           rlang_1.1.3              magrittr_2.0.3          
 [7] gypsum_0.99.9            compiler_4.4.0           RSQLite_2.3.5           
[10] GenomicFeatures_1.55.3   png_0.1-8                vctrs_0.6.5             
[13] stringr_1.5.1            ProtGenerics_1.35.2      pkgconfig_2.0.3         
[16] crayon_1.5.2             fastmap_1.1.1            dbplyr_2.4.0            
[19] XVector_0.43.1           utf8_1.2.4               Rsamtools_2.19.3        
[22] bit_4.0.5                xfun_0.42                aws.s3_0.3.21           
[25] zlibbioc_1.49.0          cachem_1.0.8             jsonlite_1.8.8          
[28] progress_1.2.3           blob_1.2.4               rhdf5filters_1.15.2     
[31] DelayedArray_0.29.7      Rhdf5lib_1.25.1          BiocParallel_1.37.0     
[34] parallel_4.4.0           prettyunits_1.2.0        R6_2.5.1                
[37] stringi_1.8.3            rtracklayer_1.63.0       Rcpp_1.0.12             
[40] knitr_1.45               base64enc_0.1-3          Matrix_1.6-4            
[43] tidyselect_1.2.0         abind_1.4-5              yaml_2.3.8              
[46] codetools_0.2-19         curl_5.2.0               lattice_0.22-5          
[49] alabaster.sce_1.3.3      tibble_3.2.1             KEGGREST_1.43.0         
[52] evaluate_0.23            BiocFileCache_2.11.1     alabaster.schemas_1.3.1 
[55] xml2_1.3.6               ExperimentHub_2.11.1     Biostrings_2.71.2       
[58] pillar_1.9.0             BiocManager_1.30.22      filelock_1.0.3          
[61] generics_0.1.3           startup_0.21.0           RCurl_1.98-1.14         
[64] BiocVersion_3.19.1       ensembldb_2.27.1         hms_1.1.3               
[67] alabaster.base_1.3.20    alabaster.ranges_1.3.3   glue_1.7.0              
[70] alabaster.matrix_1.3.12  lazyeval_0.2.2           tools_4.4.0             
[73] AnnotationHub_3.11.1     BiocIO_1.13.0            GenomicAlignments_1.39.4
[76] XML_3.99-0.16.1          rhdf5_2.47.4             grid_4.4.0              
[79] AnnotationDbi_1.65.2     GenomeInfoDbData_1.2.11  HDF5Array_1.31.5        
[82] restfulr_0.0.15          cli_3.6.2                rappdirs_0.3.3          
[85] fansi_1.0.6              S4Arrays_1.3.4           dplyr_1.1.4             
[88] AnnotationFilter_1.27.0  alabaster.se_1.3.4       digest_0.6.34           
[91] SparseArray_1.3.4        rjson_0.2.21             memoise_2.0.1           
[94] htmltools_0.5.7          lifecycle_1.0.4          httr_1.4.7              
[97] aws.signature_0.6.0      bit64_4.0.5             
> 
LTLA commented 8 months ago

This isn't a problem in and of itself, rhdf5 is just producing unnecessary messages here.

The cause is our decision to store NA_integer_ (usually from the colData) as -2^31 within the HDF5 file. rhdf5 complains because it thinks that we actually wanted to store an integer -2^31, and that is impossible to represent within an integer vector when reading the file back into R.

The solution is just to silence rhdf5, possibly by providing an option to not emit these messages if the caller knows what they're doing with respect to these integer NAs. @grimbough?

hpages commented 8 months ago

I think that the message can be silenced by setting the rhdf5-NA.OK on the dataset. This is something that high-level function rhdf5::h5write() does automatically for you:

library(rhdf5)

m <- matrix(c(1:11,NA), ncol=3)

h5write(m, "test.h5", "m")

h5readAttributes("test.h5", "m")
# $`rhdf5-NA.OK`
# [1] 1

h5read("test.h5", "m")
#      [,1] [,2] [,3]
# [1,]    1    5    9
# [2,]    2    6   10
# [3,]    3    7   11
# [4,]    4    8   NA

However, lower-level function rhdf5::H5Dwrite() does not set this attribute:

h5createDataset("test.h5", "m2", dim(m), storage.mode="integer")
fid <- H5Fopen("test.h5")
did <- H5Dopen(fid, name="m2")
H5Dwrite(did, m)
H5Dclose(did)
H5Fclose(fid)

h5readAttributes("test.h5", "m2")
# list()

h5read("test.h5", "m2")
# The value -2^31 was detected in the dataset.
# This has been converted to NA within R.
#      [,1] [,2] [,3]
# [1,]    1    5    9
# [2,]    2    6   10
# [3,]    3    7   11
# [4,]    4    8   NA

So it looks like the fix is to simply add the rhdf5-NA.OK attribute to these datasets.

LTLA commented 8 months ago

Yes, if the dataset was both generated and read by rhdf5 only. However, there are clients in other frameworks (e.g., Python, Javascript) that are writing these files, and it is a bit odd to ask them to insert an rhdf5-NA.ok attribute just to silence rhdf5.

hpages commented 8 months ago

Using -2^31 to represent a missing value must be conveyed one way or another, and that "special meaning" should be part of the data itself. Otherwise how are your Python or Java clients going to know about the real meaning of -2^31 here?

You could argue that the name of the attrribute (rhdf5-NA.OK) is a little bit unfortunate though, as it suggests that this is an rhdf5 thing. But it's not. This is essential language-agnostic metadata.

LTLA commented 8 months ago

We have a separate placeholder attribute that provides more flexibility for encoding missingness, see here. I don't expect rhdf5 to respect this, I just want it to be quiet while I handle things in alabaster.base.

hpages commented 8 months ago

How about using suppressMessages(h5read(...)) or suppressMessages(H5Dread(...)) in alabaster.base then? Since ArtifactDB datasets follow a strict policy, the risk of hiding other critical messages from the end user seems very very low.

LTLA commented 8 months ago

Yeah, I guess. Absent any other solution, suppressMessages would be the way. I don't like wholesale suppression but currently there isn't another solution.

... which is a shame. If H5Dread is intended to be a low-level function, it should just shut up and give me the data without complaints. For example, H5Dwrite doesn't try to do anything extra with NAs; for symmetry purposes if nothing else, H5Dread should just be similarly NA-unaware.

NA checks should be shifted to a higher level function - h5read, perhaps, given that h5write is also responsible for adding the rhdf5-NA.OK attribute.

hpages commented 8 months ago

Agreed. Symmetry is important.

grimbough commented 8 months ago

I don't think I have much more to add to the context here. @hpages summarise the intent behind that message well; it's to inform someone reading non-R generated data that INT_MIN will be represented as NA in R. It's just a message, not even a warning, and suppresMessages() would be my immediate suggestion.

You could argue that the name of the attrribute (rhdf5-NA.OK) is a little bit unfortunate though, as it suggests that this is an rhdf5 thing. But it's not. This is essential language-agnostic metadata.

I think I choose the name to indicate the specific software that added the attribute, in the case that someone recieved data written with rhdf5 and wanted to know what the attriubte meant. Hopefully a google would find something relevant. Perhaps it could have been R-NA.OK or similar, but given tools like hdf5r don't do this I wanted to make the source somewhat explicit. I had the vision that the may be additional metadata added over time and the could all be stored with the rhdf5- naming convention, but that hasn't come to pass.

NA checks should be shifted to a higher level function - h5read, perhaps, given that h5write is also responsible for adding the rhdf5-NA.OK attribute.

This I can get behind. I agree that the symmetry between the various h5 functions and (likewise H5<Xfoo>) is desireable. I'm pretty sure the check for printing that message is already in a seperate function, so I can probably just move where it is called.

grimbough commented 8 months ago

The printing of these messages has now been moved h5read as of rhdf5 2.47.6.

hpages commented 8 months ago

Thanks @grimbough

Also no more messages when doing PaulHSCData(legacy=FALSE) with scRNAseq 2.17.5. Thanks @LTLA!

@vjcitn This can be closed.