beta-team / beta-recsys

Beta-RecSys: Build, Evaluate and Tune Automated Recommender Systems
https://beta-recsys.readthedocs.io/en/latest/
MIT License
162 stars 33 forks source link

The complexity of leave-one-out can be improved in O(n) #351

Closed mengzaiqiao closed 3 years ago

mengzaiqiao commented 3 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

def leave_one_out(actions):
user_actions = defaultdict(list)
for action in sorted(actions, key = lambda action: action.timestamp):
user_actions[action.user_id].append(action)

train_actions = []
test_actions = []
val_actions = []

for user in user_actions:
if len(user_actions[user]) < 3:
train_actions += user_actions[user]
else:
train_actions += user_actions[user][:-2]
action_val, action_test = user_actions[user][-2:]
if random.randint(0, 1) == 0:
action_val, action_test = action_test, action_val
val_actions.append(action_val)
test_actions.append(action_test)
return train_actions, test_actions, val_actions

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

leungyukshing commented 3 years ago

I've try to implement an improved version according to the above code, but I can't see any improvements. However I come up with an idea that may work.

Basically there are two stages in this function. First we construct a map to record each user and its records. Second we do leave_one_out for each user by manipulating its records.

How about parallel the second stage? Multi-thread is not working in Python but we can use Multi-process.

leungyukshing commented 3 years ago

I've implemented an experimental version and test on Movielens_1m. There is indeed some improvements on time cost.

Origin Version: About 170s Parallel Version: About 105s

Undoubtedly, parallel version provides scalability and may achievement better improvements on larger datasets.

mengzaiqiao commented 3 years ago

I have rewritten the leave-one-out method using the following codes:

def leave_one_out(data):
    data[DEFAULT_FLAG_COL] = "train"
    data.sort_values(by=[DEFAULT_TIMESTAMP_COL], ascending=False, inplace=True)
    data.loc[data.groupby([DEFAULT_USER_COL]).head(2).index,DEFAULT_FLAG_COL]="validate"
    data.loc[data.groupby([DEFAULT_USER_COL]).head(1).index,DEFAULT_FLAG_COL]="test"

which gives an O(n) complexity.

This split method can be done in 2 seconds,

and in 30 seconds when it is with a negative sampling test set.