NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

[QST] Data loader with EmbeddingOperator using pretrained embeddings is very slow #1244

Open CarloNicolini opened 3 months ago

CarloNicolini commented 3 months ago

❓ Questions & Help

I am experiencing a large degradation in the performance of the Loader when adding a transform with EmbeddingOperator, for loading data from a pretrained embeddings numpy array. I have been following the method exposed in this tutorial notebook

Without the transforms argument the entire dataset is consumed in 6 seconds, while with the loading from pretrained embeddings array it takes almost 40 minutes! My "validation.parquet" is a small NVTabular dataset with 16 partitions, totalling almost 200 MB. Specifically with the transforms enabled, I am seeing a very low CPU and GPU utilization, as well as close to zero GPU memory consumption. Nor the CPU or the GPU gets utilized more than 6%. It seems very strange to me that simply reading batch_size specific rows from a numpy array takes that much time, even considering moving them to GPU.

Details

Here is a minimal working example to reproduce this degradation.

from __future__ import annotations

from pathlib import Path

import numpy as np
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io.dataset import Dataset
from merlin.loader.tensorflow import Loader
from tqdm.auto import tqdm

def test_pretrained_loader():
    data_path = "validation.parquet"
    data_path = Path(data_path)
    X = Dataset(data_path, engine="parquet")
    pretrained_array = np.zeros((1_000_000, 2), dtype=np.float32)

    loader = Loader(
        X,
        batch_size=4096,
        shuffle=True,
        transforms=[
            EmbeddingOperator(
                pretrained_array,
                lookup_key="recruitment_id",
                embedding_name="embeddings",
            )
        ],
        device="gpu",
    )

    for batch in tqdm(loader, desc="Iterating batches..."):
        pass

if __name__ == "__main__":
    test_pretrained_loader()

Question

Is this behaviour intended? What are possible bottlenecks for this? Is something like data prefetching or asynchronous loading applicable here?

### Tasks