PreferredAI / cornac

A Comparative Framework for Multimodal Recommender Systems
https://cornac.preferred.ai
Apache License 2.0
845 stars 138 forks source link

Feature/add dmrl: Add DMRL Model #597

Closed mabeckers closed 3 months ago

mabeckers commented 3 months ago

Description

I added Disentangled Multimodal Representation Learning (https://arxiv.org/pdf/2203.05406.pdf) as the DMRL model to cornac. In the context of this addition I had to:

  1. add the PWLearningSampler: A Sampler to be wrapped around PyTorche DataLoader class that does the necessary loading of the data in a way that DMRL training mechanism requires it (pairwise based ranking approach)
  2. add the TransformersTextModality (as described in the paper) to encode textual features into latent space
  3. add the TransformersVisionModality (as described in the paper) to encode visual features into latent space
  4. add the DistanceCorrelationCalculator which can be use to calculate the disentangled loss (as described in the paper)
  5. add dmrl and recom_dmrl as described by paper and cornac framework
  6. added tests for all above described modules
  7. had to modify BaseMethod to include TransformersTextModality as allowed testmodality
  8. TransformersVision Modality not yet included in BaseMehthod as not used in dmrl examples (preencoded cornac vision features were used). Please add if wanted.

Checklist:

tqtg commented 3 months ago

Hi @mabeckers, thanks for the contribution. It's great to see DMRL being added into Cornac. However, there are a few things that we might need to reconsider. First, each model in Cornac is very self-contained and model dependencies should not be added as global requirements. This is to minimize the maintenance effort for the core functions, also to facilitate a wider-range of model implementation. With that said, there are two directions to proceed with the DMRL model:

  1. Considering the model taking in raw text and raw video. With this approach, we should bring the text/video transformer-based encoders as part of the model implementation. For example, CDL model has an autoencoder for text.
  2. Considering the model taking in text/video features. In this case, we should do text/video encoding separately (possibly as part of the example) prior to model training in Cornac. With the embedding ready, we simply employ the FeatureModality for either text/video. We can consider if additional VideoModality is needed or ImageModality could be used as an alternative for data input.

Hope that my explanation is clear enough. Happy to chat more.

mabeckers commented 3 months ago

Hi @tqtg. thanks for taking a look at my PR. Yeah I notice that the two new transformer modalities introduced more general dependencies. I can go ahead and move those modules inside of recom_dmrl, that way the model receives basic text and image as input. I will make it general so that in case one already has encoded features from somewhere (say it comes with the example) the model can take that in as well and will not run another layer of encoding on top of that feature set. Does that sound good to you?

Thanks, Max

tqtg commented 3 months ago

sounds good to me. Let's do that and see how it goes.

mabeckers commented 3 months ago

Made the requested changes, please let me know if there's anything else I can change for this PR. Also remerged with the latest cornac master. Thanks!

tqtg commented 3 months ago

@mabeckers I did some changes to make the tests work and also refactoring. Please have a look and see if they make sense to you.

mabeckers commented 3 months ago

Everything looks great to me!

tqtg commented 3 months ago

Hey @mabeckers, there is something that we need to modify about the model input (text and image modalities). By design, we don't input the modalities directly to the model, but we input them to an evaluation method (e.g., RatioSplit). The reason is that the modalities will be aligned with user/item data splitting and user/item ID being mapped properly. Taking CDL model as an example, we input text modality to the RatioSplit eval method (here) and we can access the text modality inside the model implementation via the train_set (here). Can we work on this last change before we merge the model into Cornac?

mabeckers commented 3 months ago

Hey @tqtg Yeah I understand that's how the cornac framework works with modalities, which is why until commit 9fc96b3 I had it that way and was feeding modalities from the outside to the RatioSplit Instance. I only changed it and moved them inside the model because you mentioned you didn't want to add any general dependencies (such as new TransformerModalities) to the cornac core but move that into the DMRL folder and have the model accept raw text and images. That's why I moved it out of RatioSplit. I am happy to reverse the commit back to the earlier version and introduce TransformerVisionModadality and TransformerTextModality as new general modality encoders. I can of course also just keep them in the DMRL folder and still use them as normal modalities and input to the RatioSplit instance. Just let me know which way you would prefer it. Thanks!

tqtg commented 3 months ago

My point is that you can reuse TextModality and ImageModality to hold the image/text corpus and input them into RatioSplit to perform data splitting. The only part we want to move inside model implementation is where we use Transformers to encode the raw data. Does that make sense to you?

mabeckers commented 3 months ago

Ok I see. So if I am understanding you correctly you would want the example file running the DMRL example to look something like this?:

"""Example for Disentangled Multimodal Recommendation, with only feedback and textual modality. For an example including image modality please see dmrl_clothes_example.py"""

import cornac from cornac.data import Reader from cornac.datasets import citeulike from cornac.eval_methods import RatioSplit from cornac.models.dmrl.recom_dmrl import TextModalityInput

The necessary data can be loaded as follows

docs, item_id_ordering_text = citeulike.load_text() feedback = citeulike.load_feedback(reader=Reader(item_set=item_id_ordering_text))

text_modality_input = TextModalityInput(item_id_ordering_text, docs)

Instantiate DMRL recommender

dmrl_recommender = cornac.models.dmrl.DMRL( batch_size=4096, epochs=20, log_metrics=False, learning_rate=0.01, num_factors=2, decay_r=0.5, decay_c=0.01,
num_neg=3, embedding_dim=100, text_features=text_modality_input)

NEW METHOD THAT HOLDS THE TRANSFORMER ENCODING WITHIN DMRL MODEL:

item_text_modality = dmrl_recommender.encode_text() # returns a generic feature modality (or even a TextModality) # where pre-encoded text is given in .features attribute and uses Transformer internally.

Define an evaluation method to split feedback into train and test sets

ratio_split = RatioSplit( data=feedback, test_size=0.2, exclude_unknowns=True, verbose=True, seed=123, rating_threshold=0.5, item_text = item_text_modality )

Use Recall@300 for evaluations

rec_300 = cornac.metrics.Recall(k=300) prec_30 = cornac.metrics.Precision(k=30)

Put everything together into an experiment and run it

cornac.Experiment(eval_method=ratio_split, models=[dmrl_recommender], metrics=[prec_30, rec_300]).run()

tqtg commented 3 months ago

@mabeckers I made some changes to illustrate my idea. Please have a look and let me know if they make sense to you. We can further refactor the code to remove some unused parts.

mabeckers commented 3 months ago

@tqtg Had to set preencode=True (that means to be pre-encoded as part of TransformersModality init, preencoded means it's already pre-encoded from outside), but other than that looks very good! I understand now what you meant. We use TextModality for data splitting and id mapping on outside and "overwrite" it on the inside with TransformerModalities. Just running some final checks then will commit! Thanks for showing me this way of doing it. Only downside here is that we call vectorizer.fit_transform(self.corpus) in _build_text() of the TextModality when all we want is _swap_text() ... so a little overhead but I'm fine doing it that way :)

tqtg commented 3 months ago

@mabeckers OK, I though it should be encoded batch by batch during training thus preencode=False. Anw, please help check because I might misinterpret your implementation.

For the basic text transformation overhead, I'm aware of that and it might be an issue with a big text corpus. I was thinking of using tokenizer as the indicator whether we want to do any transformation or not. If the tokenizer is not provided during the initialization of the TextModality, we just bypass the _build_text() call. What do you think?

mabeckers commented 3 months ago

yeah the tokenizer is a good idea. That would make sense. The TransformerModalities work both in batch as well as pre-encoding. Just using them as pre-encoding in my examples bc it makes runtime a lot faster. I just pushed my final changes and checks. Thanks

tqtg commented 3 months ago

yeah the tokenizer is a good idea. That would make sense. The TransformerModalities work both in batch as well as pre-encoding. Just using them as pre-encoding in my examples bc it makes runtime a lot faster. I just pushed my final changes and checks. Thanks

Cool! This PR looks good to me. Let's merge it and have another one to update the TextModality. Thanks @mabeckers!