Closed mengzaiqiao closed 3 years ago
I've try to implement an improved version according to the above code, but I can't see any improvements. However I come up with an idea that may work.
Basically there are two stages in this function. First we construct a map to record each user and its records. Second we do leave_one_out
for each user by manipulating its records.
How about parallel the second stage? Multi-thread is not working in Python but we can use Multi-process.
I've implemented an experimental version and test on Movielens_1m. There is indeed some improvements on time cost.
Origin Version: About 170s Parallel Version: About 105s
Undoubtedly, parallel version provides scalability and may achievement better improvements on larger datasets.
I have rewritten the leave-one-out method using the following codes:
def leave_one_out(data):
data[DEFAULT_FLAG_COL] = "train"
data.sort_values(by=[DEFAULT_TIMESTAMP_COL], ascending=False, inplace=True)
data.loc[data.groupby([DEFAULT_USER_COL]).head(2).index,DEFAULT_FLAG_COL]="validate"
data.loc[data.groupby([DEFAULT_USER_COL]).head(1).index,DEFAULT_FLAG_COL]="test"
which gives an O(n) complexity.
This split method can be done in 2 seconds,
and in 30 seconds when it is with a negative sampling test set.
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.