etlundquist / rankfm

Factorization Machines for Recommendation and Ranking Problems with Implicit Feedback Data
GNU General Public License v3.0
170 stars 36 forks source link

Bug when using user features? #40

Open erlebach opened 2 years ago

erlebach commented 2 years ago

When running fit() with user features, I get the error:

KeyError: 'the users in [user_features] do not match the users in [interactions]'

which has been reported previously. In my case, I did some debugging in the source code, and found the following. In the function _init_interactions, one finds the statement:

            if np.array_equal(sorted(x_uf.index.values), self.user_idx):
                self.x_uf = np.ascontiguousarray(x_uf.sort_index(), dtype=np.float32)
            else:
                raise KeyError('the users in [user_features] do not match the users in [interactions]')

which is the error in question. Looking at the definition of self.user_idx, one finds, in the same file rankfm.py:

        # store unique values of user/item indexes and observed interactions for each user
        self.user_idx = np.arange(len(self.user_id), dtype=np.int32)
        self.item_idx = np.arange(len(self.item_id), dtype=np.int32)

near line 128. Clearly, self.user_idx are consecutive indexes 0,1,2, ... up to the number of user ids. However, sorted(x_uf.index.values) is the sorted list of user ids. Thus, the two lists cannot be equal. The code that leads me to this conclusions is:

        if user_features is not None:
            x_uf = pd.DataFrame(user_features.copy())
            x_uf = x_uf.set_index(x_uf.columns[0])
            x_uf.index = x_uf.index.map(self.user_to_index)
            if np.array_equal(sorted(x_uf.index.values), self.user_idx):

As far as I understand, the first column of user_features, which is an argument to the function, should be the actual user_id, which can be anything, as long as it does not appear twice in the dataframe. In this case, the conditional (last line) can not be satisfied. Therefore, I must not understand the data format of user_features. Where is this explained? The documentation states the following:

user_features – dataframe of user metadata features: [user_id, uf_1, … , uf_n]

with no additional information regarding the values of user_id. Any clarification would be most welcome!

erlebach commented 2 years ago

Please ignore the question. I forgot to remove duplicate member entries in the user_feature matrix.

srinivascnu166 commented 2 years ago

Hi, I have faced the same issue. Can you please provide the details of how you solved it. It would be great if you can share how you have formatted the data for user_features.

erlebach commented 2 years ago

Hi @srinivascnu166, All I did was make sure of two things: (to be checked independently for item and user features) 1) there should be no duplicate rows, i.e., no duplicate items in the item feature list 2) the list of unique items derived from the user/item list should be the same as the list of unique items derived from the item attribute list.

Does this make sense?