Alexander-Barth / NCDatasets.jl

Load and create NetCDF files in Julia
MIT License
146 stars 31 forks source link

_FillValue needed when saving undefined values #228

Closed ryofurue closed 9 months ago

ryofurue commented 1 year ago

Describe the bug

In some cases, NCDatasets writes its own "missing" values without setting _FillValue. A netCDF file created like that doesn't work correctly. (A workaround is to explicitly set _FillValue when you create the file.)

To Reproduce

using NCDatasets
using Plots

tvals = collect(map(Float64, 0:10))
var = collect(map(Float64, 200:10:300))

NCDataset("tmp.nc", "c") do ds
  defDim(ds,"time", Inf)
  defVar(ds, "time", tvals, ("time",))
  v = defVar(ds, "var", Float64, ("time",))
  v[3] = var[3] # <- Skipping other time steps
end

NCDataset("tmp.nc", "r") do ds
  v = copy(ds["var"])
  println(typeof(v))
  plot(tvals, v)  # plots 1e37 for missing values.
end

Environment

Full output

versioninfo():

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Users/furue/.julia/artifacts/2a86ef020f132332b2f4be2fb40912cc7df5da29/lib:/Users/furue/.julia/artifacts/b376941014acf8e5501996cdf9932036cfb3bb71/lib:/Users/furue/.julia/artifacts/5b338c8fa90c05e6faea86e54d2996cca76cfbbe/lib:/Users/furue/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/lib/julia:/Users/furue/.julia/artifacts/f987f10c94bc56fe3589d9b30a6e2f419402a31c/lib:/Users/furue/.julia/artifacts/9410bad2635eda2239b4a72ba4316c4aa8f5b76e/lib:/Users/furue/.julia/artifacts/c08186c25e5525a58ec149de9bdd30703c0464c2/lib:/Users/furue/.julia/artifacts/09f03e36eda588bb3a9ba375a1987f65e31538db/lib:/Users/furue/.julia/artifacts/9a76a401f82e0e3cafce618fb8d2d5c307ab2836/lib:/Users/furue/.julia/artifacts/b917751a0a1532e56881e471e0f9b441460f2295/lib:/Users/furue/.julia/artifacts/c65e07e3da4f1bf519bc432389dbbd61df320457/lib:/Users/furue/.julia/artifacts/4ec62d729213a748d2300dd0832ebe8ed2292093/lib:/Users/furue/.julia/artifacts/e6b9fb44029423f5cd69e0cbbff25abcc4b32a8f/lib:/Users/furue/.julia/artifacts/b2108f561a8812e376eb80e71a24a3678a24d231/lib:/Users/furue/.julia/artifacts/ada2a202928dd4cb2fc4bd18c4efa9d5455ec742/lib:/Users/furue/.julia/artifacts/df3881e810714d6a09467fe85a6fde79385fe702/lib:/Users/furue/.julia/artifacts/3b3d0bcaf14a9b239a4f4dc20ef7b9e63030a47e/lib:/Users/furue/.julia/artifacts/abf161ac3d4df76ae74bbf5432b7e061b3876236/lib:/Users/furue/.julia/artifacts/4260cf51a368d8e305a5de3669e32539e1e6cc72/lib:/Users/furue/.julia/artifacts/fc7ba632b72ce7d852c1924aa2bbfe244a71c780/lib:/Users/furue/.julia/artifacts/413111420faa4e2aeaa383c075eaa213402d939c/lib:/Users/furue/.julia/artifacts/ca2831bf6edc5088aec5b329ea98364951d6cad0/lib:/Users/furue/.julia/artifacts/3fe6bf926e57cc4be598151cd40832221de2e894/lib:/Users/furue/.julia/artifacts/c325a23bc1f6521474cef5f634f18c8ab311bb02/lib:/Users/furue/.julia/artifacts/0db9c3f6cf936a0da49e2ba954ba3e10bed6ad72/lib:/Users/furue/.julia/artifacts/1a7e22e66b523d9cb884cf85c3ec065b5fb3e5c3/lib:/Users/furue/.julia/artifacts/a7c8866165f6a2331113163bf4f6086b838f53dc/lib:/Users/furue/.julia/artifacts/4609432e7098d8434a7a4c7876dd5b9e09b2a5e7/lib:/Users/furue/.julia/artifacts/bf37190b92ac2fc3dd5e7073ff7ec7bbfd10343f/lib:/Users/furue/.julia/artifacts/7f4d1479db8bfb628aff3806c483e5fec617271a/lib:/Users/furue/.julia/artifacts/9472204d25ab69d52d571b650fdc9d562455ca4a/lib:/Users/furue/.julia/artifacts/b450526929615030746974fd622effa333c2c87a/lib:/Users/furue/.julia/artifacts/abb153d4516c6a0ee718ea8f8cde9466de07553c/lib:/Users/furue/.julia/artifacts/9a9f59eab237f7454fee1d6ab112a254032540b7/lib:/Users/furue/.julia/artifacts/2107e7bc404f11b178cb9724cb371ef704995727/lib:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtConcurrent.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtCore.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtDBus.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtGui.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtNetwork.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtOpenGL.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtPrintSupport.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtSql.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtTest.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtWidgets.framework/Versions/A:/Users/furue/.julia/artifacts/ed8b0e21b28aaf4ed991d176af731a3194ed83c6/lib/QtXml.framework/Versions/A:/Users/furue/.julia/artifacts/c420634d1c4328f328d728d77aadf6116a9fabc8/lib:/Users/furue/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/bin/../lib/julia:/Users/furue/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/bin/../lib:
julia> using Pkg; Pkg.status(mode=PKGMODE_MANIFEST)
Status `~/.julia/environments/v1.9/Manifest.toml`
⌅ [47edcb42] ADTypes v0.1.6
  [c3fe647b] AbstractAlgebra v0.31.1
  [621f4979] AbstractFFTs v1.5.0
  [1520ce14] AbstractTrees v0.4.4
  [79e6a3ab] Adapt v3.6.2
  [4fba245c] ArrayInterface v7.4.11
  [0e736298] Bessels v0.2.8
  [e2ed5e7c] Bijections v0.1.4
  [d1d4a3ce] BitFlags v0.1.7
  [179af706] CFTime v0.1.2
  [49dc2e85] Calculus v0.5.1
  [d360d2e6] ChainRulesCore v1.16.0
  [944b1d66] CodecZlib v0.7.2
⌃ [35d6a980] ColorSchemes v3.22.0
  [3da002f7] ColorTypes v0.11.4
  [c3611d14] ColorVectorSpace v0.10.0
  [5ae59095] Colors v0.12.10
  [861a8166] Combinatorics v1.0.2
  [1fbeeb36] CommonDataModel v0.2.4
  [38540f10] CommonSolve v0.2.4
  [bbf7d656] CommonSubexpressions v0.3.0
⌅ [34da2185] Compat v3.46.2
  [b152e2b5] CompositeTypes v0.1.3
  [f0e56b4a] ConcurrentUtilities v2.2.1
  [8f4d0f93] Conda v1.9.1
  [187b0558] ConstructionBase v1.5.3
  [d38c429a] Contour v0.6.2
  [717857b8] DSP v0.7.8
  [9a962f9c] DataAPI v1.15.0
⌃ [864edb3b] DataStructures v0.18.14
  [e2d170a0] DataValueInterfaces v1.0.0
  [8bb1440f] DelimitedFiles v1.9.1
  [163ba53b] DiffResults v1.1.0
  [b552c78f] DiffRules v1.15.1
  [0703355e] DimensionalData v0.24.13
⌃ [3c3547ce] DiskArrays v0.3.14
  [31c24e10] Distributions v0.25.100
  [ffbed154] DocStringExtensions v0.9.3
  [5b8099bc] DomainSets v0.6.7
  [fa6b7ba4] DualNumbers v0.6.8
  [7c1d4256] DynamicPolynomials v0.5.2
  [340492b5] EndpointRanges v0.2.2
  [4e289a0a] EnumX v1.0.4
  [460bff9d] ExceptionUnwrapping v0.1.9
  [e2ba6199] ExprTools v0.1.10
  [411431e0] Extents v0.1.1
  [c87230d0] FFMPEG v0.4.1
  [7a1cc6ca] FFTW v1.7.1
  [bf96fef3] FieldMetadata v0.3.1
⌃ [1a297f60] FillArrays v1.5.0
  [53c48c17] FixedPointNumbers v0.8.4
  [4c728ea3] Flatten v0.4.3
  [1fa38f19] Format v1.3.2
  [59287772] Formatting v0.4.2
  [f6369f11] ForwardDiff v0.10.36
  [069b7b12] FunctionWrappers v1.1.3
  [77dc65aa] FunctionWrappersWrappers v0.1.3
  [46192b85] GPUArraysCore v0.1.5
  [28b8d3ca] GR v0.72.9
  [68eda718] GeoFormatTypes v0.4.1
  [cf35fbd7] GeoInterface v1.3.1
  [9a22fb26] GibbsSeaWater v0.1.2
  [c27321d9] Glob v1.3.1
  [42e2da0e] Grisu v1.0.2
  [0b43b601] Groebner v0.4.2
  [d5909c97] GroupsCore v0.4.0
  [cd3eb016] HTTP v1.9.14
  [34004b35] HypergeometricFunctions v0.3.23
  [615f187c] IfElse v0.1.1
  [18e54dd8] IntegerMathUtils v0.1.2
  [8197267c] IntervalSets v0.7.7
  [41ab1584] InvertedIndices v1.3.0
  [92d709cd] IrrationalConstants v0.2.2
  [c8e1da08] IterTools v1.8.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [1019f520] JLFzf v0.1.5
⌃ [692b3bcd] JLLWrappers v1.4.1
  [682c06a0] JSON v0.21.4
  [b964fa9f] LaTeXStrings v1.3.0
  [2ee39098] LabelledArrays v1.14.0
  [984bce1d] LambertW v0.4.6
  [23fbe1c1] Latexify v0.16.1
  [50d2b5c4] Lazy v0.15.1
⌃ [2ab3a3ac] LogExpFunctions v0.3.24
⌃ [e6f89c97] LoggingExtras v1.0.0
⌃ [1914dd2f] MacroTools v0.5.10
⌃ [20f20a25] MakieCore v0.6.4
  [dbb5928d] MappedArrays v0.4.2
  [739be429] MbedTLS v1.1.7
  [442fdcdd] Measures v0.3.2
  [e1d29d7a] Missings v1.1.0
  [5cb8414e] ModuleInterfaceTools v1.0.1
  [102ac46a] MultivariatePolynomials v0.5.1
⌃ [d8a4904e] MutableArithmetics v1.3.0
  [85f8d34a] NCDatasets v0.12.17
  [77ba4419] NaNMath v1.0.2
  [510215fc] Observables v0.5.4
  [6fe1bfb0] OffsetArrays v1.12.10
  [4d8831e6] OpenSSL v1.4.1
  [bac558e1] OrderedCollections v1.6.2
  [90014a1f] PDMats v0.11.17
  [d96e819e] Parameters v0.12.3
  [69de0a69] Parsers v2.7.2
  [b98c9c47] Pipe v1.3.0
  [ccf2f8ad] PlotThemes v3.1.0
  [995b91a9] PlotUtils v1.3.5
  [a03496cd] PlotlyBase v0.8.19
  [f2990250] PlotlyKaleido v2.1.0
  [91a5bcdd] Plots v1.38.17
⌅ [f27b6e38] Polynomials v3.2.13
  [d236fae5] PreallocationTools v0.4.12
⌃ [aea7be01] PrecompileTools v1.1.2
  [21216c6a] Preferences v1.4.0
  [27ebfcd6] Primes v0.5.4
⌃ [92933f4c] ProgressMeter v1.7.2
  [438e738f] PyCall v1.96.1
⌃ [d330b81b] PyPlot v2.11.1
  [1fd47b50] QuadGK v2.8.2
  [fb686558] RandomExtensions v0.4.3
⌃ [3a07dd3d] RangeHelpers v0.1.9
⌃ [a3a2b9e3] Rasters v0.8.0
  [3cdcf5f2] RecipesBase v1.3.4
  [01d81517] RecipesPipeline v0.6.12
  [731186ca] RecursiveArrayTools v2.38.7
  [189a3867] Reexport v1.2.2
  [05181044] RelocatableFolders v1.0.0
  [ae029012] Requires v1.3.0
  [79098fc4] Rmath v0.7.1
⌃ [f2b01f46] Roots v2.0.18
  [7e49a35a] RuntimeGeneratedFunctions v0.5.12
  [fdea26ae] SIMD v3.4.5
  [0bca4576] SciMLBase v1.94.0
  [c0aeaf25] SciMLOperators v0.3.6
  [6c6a2e73] Scratch v1.2.0
  [e8f3a9d7] SearchSortedNearest v0.1.1
  [efcf1570] Setfield v1.1.1
  [992d4aef] Showoff v1.0.3
  [777ac1f9] SimpleBufferStream v1.1.0
  [66db9d55] SnoopPrecompile v1.0.3
  [a2af1166] SortingAlgorithms v1.1.1
⌃ [276daf66] SpecialFunctions v2.3.0
  [90137ffa] StaticArrays v1.6.2
  [1e83bf80] StaticArraysCore v1.4.2
  [82ae8749] StatsAPI v1.6.0
  [2913bbd2] StatsBase v0.34.0
  [4c63d2b9] StatsFuns v1.3.0
  [b5087856] StrFormat v1.0.1
  [68059f60] StrLiterals v1.1.0
  [2efcf032] SymbolicIndexingInterface v0.2.2
  [d1185830] SymbolicUtils v1.2.0
  [0c5d862f] Symbolics v5.5.1
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.10.1
  [62fd8b95] TensorCore v0.1.1
  [a759f4b9] TimerOutputs v0.5.23
  [3bb67fe8] TranscodingStreams v0.9.13
  [a2a6695c] TreeViews v0.3.0
  [410a4b4d] Tricks v0.1.7
  [781d530d] TruncatedStacktraces v1.4.0
⌃ [5c2747f8] URIs v1.4.2
  [3a884ed6] UnPack v1.0.2
  [1cfade01] UnicodeFun v0.4.1
⌃ [1986cc42] Unitful v1.15.0
  [45397f5d] UnitfulLatexify v1.6.3
  [a7c27f48] Unityper v0.1.5
  [41fe7b60] Unzip v0.2.0
  [81def892] VersionParsing v1.3.0
  [6e34b625] Bzip2_jll v1.0.8+0
  [83423d85] Cairo_jll v1.16.1+1
  [2e619515] Expat_jll v2.5.0+0
⌃ [b22a6f82] FFMPEG_jll v4.4.2+2
  [f5851436] FFTW_jll v3.3.10+0
  [a3f928ae] Fontconfig_jll v2.13.93+0
  [d7e528f0] FreeType2_jll v2.13.1+0
  [559328eb] FriBidi_jll v1.0.10+0
  [0656b61e] GLFW_jll v3.3.8+0
⌃ [d2c73de3] GR_jll v0.72.9+0
  [78b55507] Gettext_jll v0.21.0+0
  [6727f6b2] GibbsSeaWater_jll v3.5.2+0
  [7746bdde] Glib_jll v2.74.0+2
  [3b182d85] Graphite2_jll v1.3.14+0
⌅ [0234f1f7] HDF5_jll v1.12.2+2
  [2e76f6c2] HarfBuzz_jll v2.8.1+1
⌃ [1d5cc7b8] IntelOpenMP_jll v2023.1.0+0
  [aacddb02] JpegTurbo_jll v2.1.91+0
  [f7e6163d] Kaleido_jll v0.2.1+0
  [c1c5ebd0] LAME_jll v3.100.1+0
  [88015f11] LERC_jll v3.0.0+1
  [1d63c593] LLVMOpenMP_jll v15.0.4+0
  [dd4b983a] LZO_jll v2.10.1+0
⌅ [e9f186c6] Libffi_jll v3.2.2+1
  [d4300ac3] Libgcrypt_jll v1.8.7+0
  [7e76a0d4] Libglvnd_jll v1.6.0+0
  [7add5ba3] Libgpg_error_jll v1.42.0+0
  [94ce4f54] Libiconv_jll v1.16.1+2
  [4b2f31a3] Libmount_jll v2.35.0+0
  [89763e89] Libtiff_jll v4.5.1+1
  [38a345b3] Libuuid_jll v2.36.0+0
⌃ [856f044c] MKL_jll v2023.1.0+0
⌃ [7243133f] NetCDF_jll v400.902.5+1
  [e7412a2a] Ogg_jll v1.3.5+1
⌅ [458c3c95] OpenSSL_jll v1.1.21+0
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [91d4177d] Opus_jll v1.3.2+0
  [30392449] Pixman_jll v0.42.2+0
  [c0090381] Qt6Base_jll v6.4.2+3
  [f50d1b31] Rmath_jll v0.4.0+0
  [a2964d1f] Wayland_jll v1.21.0+0
  [2381bf8a] Wayland_protocols_jll v1.25.0+0
  [02c8fc9c] XML2_jll v2.10.3+0
  [aed1982a] XSLT_jll v1.1.34+0
⌃ [ffd25f8a] XZ_jll v5.4.3+1
  [4f6342f7] Xorg_libX11_jll v1.8.6+0
  [0c0b7dd1] Xorg_libXau_jll v1.0.11+0
  [935fb764] Xorg_libXcursor_jll v1.2.0+4
  [a3789734] Xorg_libXdmcp_jll v1.1.4+0
  [1082639a] Xorg_libXext_jll v1.3.4+4
  [d091e8ba] Xorg_libXfixes_jll v5.0.3+4
  [a51aa0fd] Xorg_libXi_jll v1.7.10+4
  [d1454406] Xorg_libXinerama_jll v1.1.4+4
  [ec84b674] Xorg_libXrandr_jll v1.5.2+4
  [ea2f1a96] Xorg_libXrender_jll v0.9.10+4
  [14d82f49] Xorg_libpthread_stubs_jll v0.1.1+0
  [c7cfdc94] Xorg_libxcb_jll v1.15.0+0
  [cc61e674] Xorg_libxkbfile_jll v1.1.2+0
  [12413925] Xorg_xcb_util_image_jll v0.4.0+1
  [2def613f] Xorg_xcb_util_jll v0.4.0+1
  [975044d2] Xorg_xcb_util_keysyms_jll v0.4.0+1
  [0d47668e] Xorg_xcb_util_renderutil_jll v0.3.9+1
  [c22f9ab0] Xorg_xcb_util_wm_jll v0.4.1+1
  [35661453] Xorg_xkbcomp_jll v1.4.6+0
  [33bec58e] Xorg_xkeyboard_config_jll v2.39.0+0
  [c5fb5394] Xorg_xtrans_jll v1.5.0+0
  [3161d3a3] Zstd_jll v1.5.5+0
⌅ [214eeab7] fzf_jll v0.29.0+0
  [a4ae2306] libaom_jll v3.4.0+0
  [0ac62f75] libass_jll v0.15.1+0
  [f638f0a6] libfdk_aac_jll v2.0.2+0
  [b53b4c65] libpng_jll v1.6.38+0
  [f27f6e37] libvorbis_jll v1.3.7+1
  [1270edf5] x264_jll v2021.5.5+0
  [dfaa095f] x265_jll v3.5.0+0
  [d8fb68d0] xkbcommon_jll v1.4.1+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.3
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.9.2
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics v1.9.0
  [4607b0f0] SuiteSparse
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.0.5+0
  [deac9b47] LibCURL_jll v7.84.0+0
  [29816b5a] LibSSH2_jll v1.10.2+0
  [c8ffd9c3] MbedTLS_jll v2.28.2+0
  [14a3606d] MozillaCACerts_jll v2022.10.11
  [4536629a] OpenBLAS_jll v0.3.21+4
  [05823500] OpenLibm_jll v0.8.1+0
  [efcefdf7] PCRE2_jll v10.42.0+0
  [bea87d4a] SuiteSparse_jll v5.10.1+6
  [83775a58] Zlib_jll v1.2.13+0
  [8e850b90] libblastrampoline_jll v5.8.0+0
  [8e850ede] nghttp2_jll v1.48.0+0
  [3f19e933] p7zip_jll v17.4.0+0
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated -m`
Alexander-Barth commented 1 year ago

(A workaround is to explicitly set _FillValue when you create the file.)

That's the intended behaviour. If you have a variable with missing values, you have to declare a fill value. The 1e37 was inserted by the netcdf c library because the corresponding elements where not initialized. Consider the case in Julia when you have an array with non- initialized elements. They are not getting transformed to missings either.

ryofurue commented 1 year ago

That's the intended behaviour. If you have a variable with missing values, you have to declare a fill value.

That's not the case!

using NCDatasets
tvals = collect(map(Float64, 0:3))
var = [3.0, missing, 9.1, 4.2]
NCDataset("tmp.nc", "c") do ds
  defVar(ds, "time", tvals, ("time",))
  defVar(ds, "var", var,  ("time",))
end

NCDatasets is kind enough to set _FillValue for the user! It's much better than the underlying netCDF library! It's much kinder to the user.

The 1e37 was inserted by the netcdf c library because the corresponding elements where not initialized.

But the netcdf C library doesn't recognize the CF convention. If the CF convention were part of the netcdf C library, it should set _FillValue=1e37 for the variable, to avoid creating a non-CF file.

NCDatasets is not just a wrapper to the netcdf C library. It creates (or tries to create whenever possible) netCDF files that conform to the CF convention.

Consider the case in Julia when you have an array with non- initialized elements. They are not getting transformed to missings either.

What you are saying is: In your example, NCDatasets cannot set _FillValue for the user even if it wants to help the user. (I'm sure that you are not saying you do not want to help the user.)

But, in the example I initially posted, NCDatasets can help the user by automatically setting _FillValue = 1e37, can't it? Recall the example I show above in this message of mine, where NCDatasets helps the user by automatically converting missing to a floating point value and setting that value to _FillValue. Almost exactly the same treatment can be applied to my initial example.

I think NCDatasets should create a correct CF-conforming netCDF file whenever possible with minimal efforts (and knowledge) from the user.

Alexander-Barth commented 1 year ago

In order to have type stable code, one must know at the call of defVar if the element type is Float64 or Union{Missing,Float64} (likewise for other types).

In the example you copied:

using NCDatasets
tvals = collect(map(Float64, 0:3))
var = [3.0, missing, 9.1, 4.2]
NCDataset("tmp.nc", "c") do ds
  defVar(ds, "time", tvals, ("time",))
  defVar(ds, "var", var,  ("time",))
end

defVar knows that be must set the fill value.

However, when you compare these two examples:

example 1.

using NCDatasets
tvals = collect(map(Float64, 0:2))
var = tvals

NCDataset("tmp.nc", "c") do ds
  defDim(ds,"time", Inf)
  defVar(ds, "time", tvals, ("time",))
  v = defVar(ds, "var", Float64, ("time",))
  v[3] = var[3] # <- Skipping other time steps
end

example 2.

using NCDatasets
tvals = collect(map(Float64, 0:2))
var = tvals

NCDataset("tmp.nc", "c") do ds
  defDim(ds,"time", Inf)
  defVar(ds, "time", tvals, ("time",))
  v = defVar(ds, "var", Float64, ("time",))
  v[3] = var[3] # <- Skipping other time steps
  v[1] = 1
  v[2] = 1
end

In example 1, clearly _FillValue must be set but not in example 2. The function defVar does not know whether the user will define all elements or not.

But the netcdf C library doesn't recognize the CF convention. If the CF convention were part of the netcdf C library, it should set _FillValue=1e37 for the variable, to avoid creating a non-CF file.

The only way I see to implement this is to keep track of all elements that have been written to and add the attribute _FillValue if necessary when the file is closed. But this could use potentially a lot of memory, make it unsuitable for the case where the arrays are langer than the available RAM and slow down write operations with NCDatasets.

Setting _FillValue unconditionally, would be quite annoying as future users of these files would need to handle arrays with Union{Float64,Missing} even if there is no missing value.

ryofurue commented 1 year ago

Yes, what you say makes a lot of sense. I now see the problem better.

But,

Setting _FillValue unconditionally, would be quite annoying as future users of these files would need to handle arrays with Union{Float64,Missing} even if there is no missing value.

Would it really cause any practical problems? Who would be annoyed in what way?

need to handle arrays with Union{Float64,Missing} even if there is no missing value.

I don't get this point. In most use cases, you just don't care whether it's Union{Float64,Missing} or Float64 when there is no missing value. When there are missing values, you are glad that it's Union{Float64,Missing} :-)

To me, Union{Float64,Missing} is one of the greatest features of Julia.

So, I would say, set _FillValue = NaN to all floating-point variables unless the user opts out.

This is all consistent with the netcdf C library: It puts undefined values as 1e37. So, the CF convention should say that all variable shall have _FillValue = 1e37 by default. As long as one uses netCDF files, one should regard missing values as an integral part.

I notice that in this issue tracker, there is a problem with calling an external Python function with Union{Float64,Missing}. But that has nothing to do with NCDatasets. Missing is a part of the Julia language and so the writer of the python-calling function has to be prepared to deal with Missing to map Julia's Missing to Python's mask.

Alexander-Barth commented 1 year ago

Having an array of Union{Float64,Missing} requires the compiler add a special cases for missing elements, so it is typically slower (see below). You need 1 byte more per element and it is problematic on a GPU (https://github.com/Alexander-Barth/NCDatasets.jl/issues/132). Beside the issue with PyPlot and PythonPlot, it makes also the interoperability with C libraries difficult. A Matrix{Float64} can be passed to a C function without copying which is not the case for a Matrix{Union{Float64,Missing}}.

Union{Float64,Missing} has its place when there are actually missing value, but I think it would be heavy-handed to impose it in every case.

Additional point to consider: the CF convention does not allow _FillValue for coordinate variables (https://cfconventions.org/cf-conventions/cf-conventions.html#attribute-appendix). The user is free rename a variable, so that it becomes a coordinate variable after the variable is defined in the NetCDF file.

julia> using Missings

julia> A = zeros(1000_0000);

julia> function addone!(A)
         for i in eachindex(A)
            A[i] += 1
         end
       end
addone! (generic function with 1 method)

julia> B = allowmissing(A);

julia> @code_typed addone!(A)
CodeInfo(
1 ── %1  = Base.arraysize(A, 1)::Int64
│    %2  = Base.slt_int(%1, 0)::Bool
│    %3  = Core.ifelse(%2, 0, %1)::Int64
│    %4  = Base.slt_int(%3, 1)::Bool
└───       goto #3 if not %4
2 ──       goto #4
3 ──       goto #4
4 ┄─ %8  = φ (#2 => true, #3 => false)::Bool
│    %9  = φ (#3 => 1)::Int64
│    %10 = φ (#3 => 1)::Int64
│    %11 = Base.not_int(%8)::Bool
└───       goto #10 if not %11
5 ┄─ %13 = φ (#4 => %9, #9 => %23)::Int64
│    %14 = φ (#4 => %10, #9 => %24)::Int64
│    %15 = Base.arrayref(true, A, %13)::Float64
│    %16 = Base.add_float(%15, 1.0)::Float64
│          Base.arrayset(true, A, %16, %13)::Vector{Float64}
│    %18 = (%14 === %3)::Bool
└───       goto #7 if not %18
6 ──       goto #8
7 ── %21 = Base.add_int(%14, 1)::Int64
└───       goto #8
8 ┄─ %23 = φ (#7 => %21)::Int64
│    %24 = φ (#7 => %21)::Int64
│    %25 = φ (#6 => true, #7 => false)::Bool
│    %26 = Base.not_int(%25)::Bool
└───       goto #10 if not %26
9 ──       goto #5
10 ┄       return nothing
) => Nothing

julia> @code_typed addone!(B)
CodeInfo(
1 ── %1  = Base.arraysize(A, 1)::Int64
│    %2  = Base.slt_int(%1, 0)::Bool
│    %3  = Core.ifelse(%2, 0, %1)::Int64
│    %4  = Base.slt_int(%3, 1)::Bool
└───       goto #3 if not %4
2 ──       goto #4
3 ──       goto #4
4 ┄─ %8  = φ (#2 => true, #3 => false)::Bool
│    %9  = φ (#3 => 1)::Int64
│    %10 = φ (#3 => 1)::Int64
│    %11 = Base.not_int(%8)::Bool
└───       goto #20 if not %11
5 ┄─ %13 = φ (#4 => %9, #19 => %42)::Int64
│    %14 = φ (#4 => %10, #19 => %43)::Int64
│    %15 = Base.arrayref(true, A, %13)::Union{Missing, Float64}
│    %16 = (isa)(%15, Missing)::Bool
└───       goto #7 if not %16
6 ──       goto #10
7 ── %19 = (isa)(%15, Float64)::Bool
└───       goto #9 if not %19
8 ── %21 = π (%15, Float64)
│    %22 = Base.add_float(%21, 1.0)::Float64
└───       goto #10
9 ──       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└───       unreachable
10 ┄ %26 = φ (#6 => true, #8 => false)::Bool
│    %27 = φ (#6 => false, #8 => true)::Bool
│    %28 = φ (#8 => %22)::Float64
└───       goto #12 if not %26
11 ─       Base.arrayset(true, A, $(QuoteNode(missing)), %13)::Vector{Union{Missing, Float64}}
└───       goto #15
12 ─       goto #14 if not %27
13 ─       Base.arrayset(true, A, %28, %13)::Vector{Union{Missing, Float64}}
└───       goto #15
14 ─       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└───       unreachable
15 ┄ %37 = (%14 === %3)::Bool
└───       goto #17 if not %37
16 ─       goto #18
17 ─ %40 = Base.add_int(%14, 1)::Int64
└───       goto #18
18 ┄ %42 = φ (#17 => %40)::Int64
│    %43 = φ (#17 => %40)::Int64
│    %44 = φ (#16 => true, #17 => false)::Bool
│    %45 = Base.not_int(%44)::Bool
└───       goto #20 if not %45
19 ─       goto #5
20 ┄       return nothing
) => Nothing

julia> @btime addone!(A)
  3.034 ms (0 allocations: 0 bytes)

julia> @btime addone!(B)
  9.824 ms (0 allocations: 0 bytes)
ryofurue commented 1 year ago

Having an array of Union{Float64,Missing} requires the compiler add a special cases for missing elements, so it is typically slower

I'm aware of that, but do you use a netCDF array in performance critical loops? I wouldn't.

I would copy a section of the array from the netCDF file into the main memory for performance:

v = copy(ds["myvar"][:, :, 3,:]) # v is Array{Union{Float64,Missing}}
# work on v[:,:,:] in the loop

So, if the performance degradation of Missing is really problematic for you, you would change the above code to

v = replace(ds["myvar"][:,:,3,:], missing=>NaN) # v is Array{Float64}
# work on v in the loop.

I mean, if you need performance, you copy the data from the netCDF file into a native array anyway. So, if the overhead of Missing really matters, you just replace missing to NaN when copying your data. The cost of this replacement is negligible compared to the read from the file into memory.

Or, do you really mean that a netCDF array is as fast as a native array and so people actually use netCDF arrays in performance critical loops without copying them into native arrays?

A Matrix{Float64} can be passed to a C function without copying which is not the case for a Matrix{Union{Float64,Missing}}.

But you have to copy the data from the netCDF file to a native array anyway. I mean, you have to copy ds["myvar"][:,:,:] into Array{Float64,3} and send the latter to the C function. Is that right? The copy is necessary whether the netCDF array has Missing or not. So, in a situation where Missing doesn't work, you copy into a Float64 array, not into a Union{Missing, Float64} array.

Additional point to consider: the CF convention does not allow _FillValue for coordinate variables . . .

I agree that that's a real problem. (You could scan the coordinate variables and delete _FillValue before saving it, but it's ugly, I admit.)

Alexander-Barth commented 1 year ago

I'm aware of that, but do you use a netCDF array in performance critical loops? I wouldn't.

Once you index a NCVariable.CFVariable (e.g. ds["var"][:,:]) you get an plain Julia array (and directly suitable for number crunching) and no longer a NetCDF array. In your example the copy is unnecessary.

If you use only small arrays, then the need to copy would not matter to you. But for very large arrays (or systems with low memory), this can be an issue (I had past issue report about this).

Concerning the copy, assuming that you copy an array from a netcdf file (without fill value), currently the flow is:

input.nc -> Array{Float64,3} -> output.nc

With a fill value set it is:

input.nc -> Array{Float64,3} -> Array{Union{Float64,Missing},3} -> Array{Float64,3} -> output.nc

The netCDF C code can only deal with Array{Float64,3}.

Alexander-Barth commented 9 months ago

I am closing this issue because I think that the current behavior best serves the most users.

ryofurue commented 9 months ago

Sorry for my silence for this issue. This message is just for discussion. I don't mean to re-open the issue.

If you use only small arrays, then the need to copy would not matter to you. But for very large arrays (or systems with low memory), this can be an issue (I had past issue report about this).

What do you mean by "this"? If the entire array doesn't fit into the memory, you want to use the netCDF array without copying it. But in that case, you'll use the netCDF array without copying it into the memory. That's what you are saying. I that right?

But you don't want to that in a tight loop because a lot of time will be then spend to read pieces of the big array from the file.

So, it seems to me you are confusing these things:

  1. If you need to use the array in a tight loop, you copy it to the memory before entering the loop.
  2. If the entire data doesn't fit in the memory, you copy a slice into the memory and work on it in the loop.
  3. If performance doesn't matter to you, you use the netCDF array as is, in the loop.

In cases 1 & 2, you handle Missing when copying the (slice of the) array into memory.

In case 3, the performance degradation due to Missing doesn't matter. If you use a big netCDF array within the loop, the time spent to read the pieces from the file overwhelms the time to handle Missing. If you don't like the low performance, you have to go to method 1 or 2.

So, that was my argument. Performance hit due to Missing doesn't matter for a netCDF array. If it matters, you copy the data from the file and remove Missing when copying.