dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.07k stars 8.7k forks source link

Don't construct a column matrix when there's no missing value #10372

Open andrew-esteban-imc opened 3 months ago

andrew-esteban-imc commented 3 months ago

I've spotted that there's been a FIXME in column_matrix.h since 2022 that constructs a column matrix for a DMatrix even when there's no missing value:

// all columns are dense column and has no missing value
  // FIXME(jiamingy): We don't need a column matrix if there's no missing value.
  template <typename RowBinIdxT>
  void SetIndexNoMissing(bst_idx_t base_rowid, RowBinIdxT const* row_index, const size_t n_samples,
                         const size_t n_features, int32_t n_threads) {
    missing_.GrowTo(feature_offsets_[n_features], false);

    DispatchBinType(bins_type_size_, [&](auto t) {
      using ColumnBinT = decltype(t);
      auto column_index = Span<ColumnBinT>{reinterpret_cast<ColumnBinT*>(index_.data()),
                                           static_cast<size_t>(index_.size() / sizeof(ColumnBinT))};
      ParallelFor(n_samples, n_threads, [&](auto rid) {
        rid += base_rowid;
        const size_t ibegin = rid * n_features;
        const size_t iend = (rid + 1) * n_features;
        for (size_t i = ibegin, j = 0; i < iend; ++i, ++j) {
          const size_t idx = feature_offsets_[j];
          // No need to add offset, as row index is compressed and stores the local index
          column_index[idx + rid] = row_index[i];
        }
      });
    });
  }

In testing, this step seems to make up 30% of the time building a Quantile DMatrix, as well as contributes a decently large amount of memory consumption.

I'm very much not a CPP expert, so may be wrong in my understanding, but based on the presence of a FIXME (@trivialfis), I think a decently large performance boost can come from making this change.

trivialfis commented 3 months ago

Thank you for reminding me of the fixme. Removing the column matrix requires changing other places as well since they expect a column matrix to be present. I will work on it after sorting out other priorities.