Closed Jeriousman closed 2 years ago
@Jeriousman Hello, thanks for your attention to RecBole! As you said, the overall data flow can be described as raw->atomic->dataset(dataframe)->dataloader(interaction). We use create_dataset() to convert atomic files into Dataset(That is what we fed to init.), during the transformation we make a series of preprocess. And then we use data_preparation to convert Dataset into Dataloader, where the data will be splitted. Here is an example:
# dataset filtering
dataset = create_dataset(config)
logger.info(dataset)
# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)
# model loading and initialization
model = NewModel(config, train_data.dataset).to(config['device'])
So what we feed into the init is the Dataset from create_dataset() and then model(config, Dataloader Dataset)? (Just to double check). Thank you!
Sorry, I don't know Dataloader Dataset
means. Maybe what I said above is not clear.
In the example above, after we get the dataset from create_dataset()
, we use data_preparation(config, dataset)
to get train_data, valid data, test_data
(they are all dataloader ).
but during data_preparation, the interaction features in dataset will be split and all the other attributes the same:
built_datasets = dataset.build()
train_dataset, valid_dataset, test_dataset = built_datasets
# they are all class:`~Dataset`
And it set train_data.dataset = train_dataset
, valid_data.dataset =valid_dataset
...
So in the example above, model = NewModel(config, train_data.dataset)
, train_data.dataset is just class:~Dataset
, whose interaction features has been split.
Maybe not clear enough, but if you have any questions please ask. You can also read the code in this section to understand better
I know what you mean. but I just asked is even though the difference is the whole data and the split of the data, data from create_dataset() is in atomic format, and data from data_preparation(config, dataset) is in dataloader format so what I asked is
data from create_dataset() goes to init of BERT4Rec for example, and, data from data_preparation(config, dataset) goes to below (let's call the data -> train_data which is the dataloader data from data_preparation)
iter_data = (
tqdm(
train_data,
total=len(train_data),
ncols=100,
) if False else train_data
)
for batch_idx, interaction in enumerate(iter_data):
interaction = interaction.to(model.device)
for calculate_loss as dataloader format goes in for mini-batch training. I just wanted to be crystal clear if I am right. Your feedback has been awesome by the way. So, I guess I am right on this?
Thank you in advance.
Yeah, you are right.
@leoleojie Thank you very much. Xiexie
For example, for BERT4Rec, in its init, we should feed dataset. But as there is a flow of data (raw -> atomic -> dataframe -> dadaloader), I am a bit confused so I am asking to clarify.
What should the dataset that is fed to the init(config, dataset)? (below you can see the example of BERT4REC) Is it the dataset form of data I get when I do config 'save_dataloaders': True?
As far as I know, the dataloaders we get from data_preparation function is the same with data from 'save_dataloaders': True.