NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.08k stars 142 forks source link

from transformers4rec.torch.utils.data_utils import NVTabularDataLoader not working #637

Closed sparta0000 closed 1 year ago

sparta0000 commented 1 year ago

❓ Questions & Help

Details

from Tutorial Session-based-recsys.ipynb, i tried executing this block , but it's saying ImportError: cannot import name 'NVTabularDataLoader' from 'transformers4rec.torch.utils.data_utils' (/usr/local/lib/python3.9/dist-packages/transformers4rec/torch/utils/data_utils.py)

Full code:

 import NVTabular dependencies
from transformers4rec.torch.utils.data_utils import NVTabularDataLoader

x_cat_names, x_cont_names = ['product_id-list_seq'], []

# dictionary representing max sequence length for column
sparse_features_max = {
    fname: sequence_length
    for fname in x_cat_names + x_cont_names
}

# Define a `get_dataloader` function to call in the training loop
def get_dataloader(path, batch_size=32):

    return NVTabularDataLoader.from_schema(
        schema,
        path, 
        batch_size,
        max_sequence_length=sequence_length,
        sparse_names=x_cat_names + x_cont_names,
        sparse_max=sparse_features_max,
)

I also checked transformers4rec.torch.utils.data_utils source code but there is no module named NVTabularDataLoader, there is MerlinDataLoader instead. However after changing module name , it is giving further error while executing other block named

Model finetuning and incremental evaluation

Error:

ModuleNotFoundError                       Traceback (most recent call last)
[/usr/local/lib/python3.9/dist-packages/merlin/models/utils/misc_utils.py](https://localhost:8080/#) in validate_dataset(paths_or_dataset, batch_size, buffer_size, engine, reader_kwargs)
    196     try:
--> 197         from nvtabular.io.dataset import Dataset
    198     except ImportError:

ModuleNotFoundError: No module named 'nvtabular.io.dataset'; 'nvtabular.io' is not a package

If anyone knows the solution pls help

rnyak commented 1 year ago

@sparta0000 which container version are you using if you are using merlin docker image? if not, how did you install merlin libraries? thanks.

sparta0000 commented 1 year ago

@rnyak I installed using this :

# Install the Merlin Framework
!pip install -U git+https://github.com/NVIDIA-Merlin/models.git
!pip install -U git+https://github.com/NVIDIA-Merlin/nvtabular.git
!pip install -U git+https://github.com/NVIDIA-Merlin/core.git
!pip install -U git+https://github.com/NVIDIA-Merlin/system.git
!pip install -U git+https://github.com/NVIDIA-Merlin/dataloader.git
!pip install -U git+https://github.com/NVIDIA-Merlin/Transformers4Rec.git
!pip install -U xgboost lightfm implicit

Thanks

rnyak commented 1 year ago

@sparta0000 looks like you are installing from development branch.. can you install the libs from release-23.02 branch and test again pls? thanks.

sparta0000 commented 1 year ago

@rnyak tried but same results

# Install the Merlin Framework
!pip install -U git+https://github.com/NVIDIA-Merlin/models.git@release-23.02
!pip install -U git+https://github.com/NVIDIA-Merlin/nvtabular.git@release-23.02
!pip install -U git+https://github.com/NVIDIA-Merlin/core.git@release-23.02
!pip install -U git+https://github.com/NVIDIA-Merlin/system.git@release-23.02
!pip install -U git+https://github.com/NVIDIA-Merlin/dataloader.git@release-23.02
!pip install -U git+https://github.com/NVIDIA-Merlin/Transformers4Rec.git@release-23.02
!pip install -U xgboost lightfm implicit

Here's Source code for transformers4rec.torch.utils.data_utils https://nvidia-merlin.github.io/Transformers4Rec/main/_modules/transformers4rec/torch/utils/data_utils.html

i don't see NVTabularDataLoader any idea why ?

rnyak commented 1 year ago

@sparta0000

you can replace that part with

from transformers4rec.torch.utils.data_utils import MerlinDataLoader

def get_dataloader(data_path, batch_size=128):
        loader = MerlinDataLoader.from_schema(
            schema,
            data_path,
            batch_size,
            max_sequence_length=sequence_length,
            shuffle=False,
        )
        return loader

You might also need to change the max values in the schema_tutorial.pb file.. How you can do that is just read the properties.embedding_sizes.cardinality values for each column in workflow.output_schema and then go to schema_tutorial.pb and then replace the max value for each feature with the value in the properties.embedding_sizes.cardinality.

rnyak commented 1 year ago

I am gonna update that notebook.

sparta0000 commented 1 year ago

@rnyak Thanks for the help, i will try with these changes.

sparta0000 commented 1 year ago

@rnyak I tried to update max in schema , but what to put in places where properties.embedding_sizes.cardinality is NaN ?

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

name | dtype | properties.embedding_sizes.cardinality -- | -- | -- user_session | DType(name='int64', element_type=

sparta0000 commented 1 year ago

@sparta0000

you can replace that part with

from transformers4rec.torch.utils.data_utils import MerlinDataLoader

def get_dataloader(data_path, batch_size=128):
        loader = MerlinDataLoader.from_schema(
            schema,
            data_path,
            batch_size,
            max_sequence_length=sequence_length,
            shuffle=False,
        )
        return loader

You might also need to change the max values in the schema_tutorial.pb file.. How you can do that is just read the properties.embedding_sizes.cardinality values for each column in workflow.output_schema and then go to schema_tutorial.pb and then replace the max value for each feature with the value in the properties.embedding_sizes.cardinality.

After updating above block , i tried executing this block

%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))

    # Initialize dataloaders
    trainer.train_dataloader = get_dataloader(train_paths, train_args.per_device_train_batch_size)
    trainer.eval_dataloader = get_dataloader(eval_paths, train_args.per_device_eval_batch_size)

    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1

    # Evaluate on the following day
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    wipe_memory()

But it is giving error :

ModuleNotFoundError                       Traceback (most recent call last)
[/usr/local/lib/python3.9/dist-packages/merlin/models/utils/misc_utils.py](https://localhost:8080/#) in validate_dataset(paths_or_dataset, batch_size, buffer_size, engine, reader_kwargs)
    196     try:
--> 197         from nvtabular.io.dataset import Dataset
    198     except ImportError:

ModuleNotFoundError: No module named 'nvtabular.io.dataset'; 'nvtabular.io' is not a package

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<timed exec> in <module>

4 frames
[/usr/local/lib/python3.9/dist-packages/merlin/models/utils/misc_utils.py](https://localhost:8080/#) in validate_dataset(paths_or_dataset, batch_size, buffer_size, engine, reader_kwargs)
    197         from nvtabular.io.dataset import Dataset
    198     except ImportError:
--> 199         raise ValueError("NVTabular is necessary for this function, please install: " "nvtabular.")
    200 
    201     # TODO: put this in parent class and allow

ValueError: NVTabular is necessary for this function, please install: nvtabular.

then i checked for import nvtabular , it is showing true

oliverholworthy commented 1 year ago

@sparta0000 The latest published version of Transformers4Rec does not yet depend on Merlin Models. The error about nvtabular.io.dataset should not show up if you have installed a published tagged version of the package.

Please check the version you have installed by running:

import transformers4rec
transformers4rec.__version__

This should print out a tagged version. For example the most recent release:

23.02.00

If instead you see something like 0.1.14+63.g70717bdfb that means you have a development version installed in your environment.

If you cloned the Transformers4Rec using git. You can check which branch you're on with git branch --show-current.

If you used another method to get Transformers4Rec into your environment please share how you installed it so that we can help identify the correct installation instructions for this to help you and other people using this package.

sparta0000 commented 1 year ago

@oliverholworthy Yes , my environment had reset it to 22.12 , i forgot to edit . Now it has 23.02 and this block executed . Thanks much

rnyak commented 1 year ago

@rnyak I tried to update max in schema , but what to put in places where properties.embedding_sizes.cardinality is NaN ?

name dtype properties.embedding_sizes.cardinality user_session DType(name='int64', element_type=<ElementType.... 480784 product_id-count DType(name='int32', element_type=<ElementType.... 176393 product_id-list_seq DType(name='int64', element_type=<ElementType.... 176393 category_code-list_seq DType(name='int64', element_type=<ElementType.... 34092 event_type-list_seq DType(name='int64', element_type=<ElementType.... 6 brand-list_seq DType(name='int64', element_type=<ElementType.... 4191 category_id-list_seq DType(name='int64', element_type=<ElementType.... 1009 et_dayofweek_sin-list_seq DType(name='float64', element_type=<ElementTyp... NaN et_dayofweek_cos-list_seq DType(name='float64', element_type=<ElementTyp... NaN price_log_norm-list_seq DType(name='float64', element_type=<ElementTyp... NaN relative_price_to_avg_categ_id-list_seq DType(name='float64', element_type=<ElementTyp... NaN product_recency_days_log_norm-list_seq DType(name='float32', element_type=<ElementTyp... NaN day_index DType(name='int64', element_type=<ElementType.... NaN

NAN means this column is not categorical so you dont create embedding tables for them so we dont use their cardinalities in the schema file. Therefore, you dont need to change max values for continuous columns (e.g. et_dayofweek_sin-list_seq) in the schema file. It wont effect anything.

rnyak commented 1 year ago

closing this since this issue was solved.