JuliaML / MLUtils.jl

Utilities and abstractions for Machine Learning tasks
MIT License
107 stars 20 forks source link

Un-documented behaviour of `splitobs(...; at=1)` #166

Open mcabbott opened 11 months ago

mcabbott commented 11 months ago

The keyword at is described as a proportion, but secretly has quite different behaviour when it's an integer. IMO it would be clearest if these had distinct names, but if both are called at, the two paths should both be clearly documented.

julia> splitobs(100, at=1.0)
(1:100, 101:100)

julia> splitobs(100, at=1)
(1:1, 2:100)

help?> splitobs
search: splitobs splitext splitdir split split_rest splitpath splitdrive splice! splat rsplit

  splitobs(n::Int; at) -> Tuple

  Compute the indices for two or more disjoint subsets of the range 1:n with splits given by at.

  Examples
  ≡≡≡≡≡≡≡≡

  julia> splitobs(100, at=0.7)
  (1:70, 71:100)

  julia> splitobs(100, at=(0.1, 0.4))
  (1:10, 11:50, 51:100)

  ────────────────────────────────────────────────────────────────────────────────────────────────

  splitobs(data; at, shuffle=false) -> Tuple

  Split the data into multiple subsets proportional to the value(s) of at.

  If shuffle=true, randomly permute the observations before splitting.

  Supports any datatype implementing the numobs and getobs interfaces.

  Examples
  ≡≡≡≡≡≡≡≡

  # A 70%-30% split
  train, test = splitobs(X, at=0.7)

  # A 50%-30%-20% split
  train, val, test = splitobs(X, at=(0.5, 0.3))

  # A 70%-30% split with multiple arrays and shuffling
  train, test = splitobs((X, y), at=0.7, shuffle=true)
  Xtrain, Ytrain = train