An updated version of the repository is availabe at:


This is an implementation of the Joint Representation Learning Model (JRLM) for product recommendation based on heterogeneous information sources [2]. Please cite the following paper if you plan to use it for your project:

    Yongfeng Zhang, Qingyao Ai, Xu Chen, W. Bruce Croft.  2017.
"Joint Representation Learning for Top-N Recommendation with Heterogeneous
Information Sources".  In Proceedings of CIKM ’17.

The JRL is a deep neural network model that jointly learn latent representations for products and users based on reviews, images and product ratings. The model can jointly or independently latent representations for products and users based on different information.

The probability (which is also the rank score) of an product being purchased by a user can be computed with their concatenated latent representations from different information sources. Please refer to the paper for more details.


o To run the JRL model in ./JRL/ and the python scripts in ./scripts/, python 2.7+ and Tensorflow v1.0+ are needed.

o To run the jar package in ./jar/, JDK 1.7 is needed.

o To compile the java code in ./java/, Galago from the Lemur Project is needed. (https://sourceforge.net/p/lemur/wiki/Galago%20Installation/)

Data Preparation

o Note: the already splitted dataset used in this paper can be downloaded from the following link:


If the above link doesn't work, please click the following one:


If you want to process new datasets, please follow the instructions below.

o Download Amazon review datasets from http://jmcauley.ucsd.edu/data/amazon/. In our paper, we used 5-core data.

o Stem and remove stop words from the Amazon review datasets if needed. In our paper, we stem the field of “reviewText” and “summary” without stop word removal.

   java -Xmx4g -jar ./jar/AmazonReviewData_preprocess.jar <jsonConfigFile> <review_file> <output_review_file>


   <jsonConfigFile>       A JSON file that specify the file path of stop words list.
                          An example can be found in the root directory.  Enter “false” if
                          you don’t want to remove stop words. 

   <review_file>          The path for the original Amazon review data.

   <output_review_file>   The output path for processed Amazon review data.

o Index datasets

    python ./scripts/index_and_filter_review_file.py <review_file> <indexed_data_dir> <min_count>


    <review_file>       The file path for the Amazon review data.

    <indexed_data_dir>  The output directory for indexed data.

    <min_count>         The minimum count for terms.  If a term appears less then <min_count>
                    times in the data, it will be ignored.

o Split train/test
    -- Download the meta data from http://jmcauley.ucsd.edu/data/amazon/ 

    -- Split datasets for training and test

         python ./scripts/split_train_test.py <indexed_data_dir> <review_sample_rate>


         <indexed_data_dir>    The directory for indexed data.
         <review_sample_rate>  The proportion of reviews used in test for each user.  In our
                           paper, we used 0.3.

    --  Match image features
        + Download the image features from http://jmcauley.ucsd.edu/data/amazon/ .

        + Match image features with product ids.

            python ./scripts/match_with_image_features.py <indexed_data_dir> <image_feature_file>


    <indexed_data_dir>     The directory for indexed data.
    <image_feature_file>   The file for image features data.

    -- Match rating features
       + Construct latent representations based on rating information with any method you like
     (e.g. BPR).

       + Format the latent factors of items and users in "item_factors.csv" and "user_factors.csv"
     such that each row represents one latent vector for the corresponding item/user in the
     <indexed_data_dir>/product.txt.gz and user.txt.gz.  See example csv files.

       + Put the item_factors.csv and user_factors.csv into <indexed_data_dir>.

Model Training/Testing

python ./JRL/main.py -- --

where parameter names and values include:

learning_rate The learning rate in training. Default 0.05.

learning_rate_decay_factor Learning rate decays by this much whenever the loss is higher than three previous losses. Default 0.90.

max_gradient_norm Clip gradients to this norm. Default 5.0.

subsampling_rate The rate to subsampling. Default 1e-4.

L2_lambda The lambda for L2 regularization. Default 0.0.

image_weight The weight for image feature based training loss. See the paper for more details.

batch_size Batch size used in training. Default 64.

data_dir Data directory, which should be the .

input_train_dir The directory of training and testing data, which usually is

