RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.48k stars 616 forks source link

The data that is fed into models. #1207

Closed Jeriousman closed 2 years ago

Jeriousman commented 2 years ago

For example, for BERT4Rec, in its init, we should feed dataset. But as there is a flow of data (raw -> atomic -> dataframe -> dadaloader), I am a bit confused so I am asking to clarify.

What should the dataset that is fed to the init(config, dataset)? (below you can see the example of BERT4REC) Is it the dataset form of data I get when I do config 'save_dataloaders': True?

As far as I know, the dataloaders we get from data_preparation function is the same with data from 'save_dataloaders': True.

class BERT4Rec(SequentialRecommender):

    def __init__(self, config, dataset):
        super(BERT4Rec, self).__init__(config, dataset)
leoleojie commented 2 years ago

@Jeriousman Hello, thanks for your attention to RecBole! As you said, the overall data flow can be described as raw->atomic->dataset(dataframe)->dataloader(interaction). We use create_dataset() to convert atomic files into Dataset(That is what we fed to init.), during the transformation we make a series of preprocess. And then we use data_preparation to convert Dataset into Dataloader, where the data will be splitted. Here is an example:

    # dataset filtering
    dataset = create_dataset(config)
    logger.info(dataset)

    # dataset splitting
    train_data, valid_data, test_data = data_preparation(config, dataset)

    # model loading and initialization
    model = NewModel(config, train_data.dataset).to(config['device'])
Jeriousman commented 2 years ago

So what we feed into the init is the Dataset from create_dataset() and then model(config, Dataloader Dataset)? (Just to double check). Thank you!

leoleojie commented 2 years ago

Sorry, I don't know Dataloader Dataset means. Maybe what I said above is not clear. In the example above, after we get the dataset from create_dataset(), we use data_preparation(config, dataset) to get train_data, valid data, test_data(they are all dataloader ). but during data_preparation, the interaction features in dataset will be split and all the other attributes the same:

built_datasets = dataset.build()
train_dataset, valid_dataset, test_dataset = built_datasets
# they are all class:`~Dataset`

And it set train_data.dataset = train_dataset, valid_data.dataset =valid_dataset... So in the example above, model = NewModel(config, train_data.dataset) , train_data.dataset is just class:~Dataset, whose interaction features has been split.

Maybe not clear enough, but if you have any questions please ask. You can also read the code in this section to understand better

Jeriousman commented 2 years ago

I know what you mean. but I just asked is even though the difference is the whole data and the split of the data, data from create_dataset() is in atomic format, and data from data_preparation(config, dataset) is in dataloader format so what I asked is

data from create_dataset() goes to init of BERT4Rec for example, and, data from data_preparation(config, dataset) goes to below (let's call the data -> train_data which is the dataloader data from data_preparation)

iter_data = (
    tqdm(
        train_data,
        total=len(train_data),
        ncols=100,

    ) if False else train_data
)

for batch_idx, interaction in enumerate(iter_data):
    interaction = interaction.to(model.device)

for calculate_loss as dataloader format goes in for mini-batch training. I just wanted to be crystal clear if I am right. Your feedback has been awesome by the way. So, I guess I am right on this?

Thank you in advance.

leoleojie commented 2 years ago

Yeah, you are right.

Jeriousman commented 2 years ago

@leoleojie Thank you very much. Xiexie