Open CarloNicolini opened 5 months ago
@CarloNicolini please provide a minimal reproducible example so that we can run and reproduce the issue you are facing.
what are the dtypes of your list columnss? Are you properly categorifying the list features using NVTabular and are you transforming your validation data accordingly?
why do you think you need shuffle_by_keys
? we have shuffle_by_keys
in the Groupby
op, in case one is doing groupby for a given column (say unique session id) but their unique session id is scattered over different parquet files, BUT we dont recommend to use it for large datasets. are you doing something like that? you are finetuning a Two-tower model right? not a session based model, I believe.
❓ Questions & Help
I am using the Nvidia Merlin Docker 23.08 Tensorflow container.
I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.
I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as
genres
in the MovieLens dataset). The training starts and the loss function decreases but at the validation step I get anUnknown error
that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.I've then tried to run some test iterations on the valid dataset and found with my surprise that even the
mm.Loader
cannot correctly iterate on the validation dataset.In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the
batch_size
to 1 which every number is divisible from. Indeed, this simple loop raiseStopIteration
.I hope this is something bad on my side. I didn't do the
shuffle_by_keys
method on the loaded dataset, nor in the phase of its creation. Is this related?