CoderWZW / ARLib

An open-source framework for conducting data poisoning attacks on recommendation systems, designed to assist researchers and practitioners.
89 stars 13 forks source link

Wondering the reason why filter the data with score >=4 only in ml-1m #13

Closed JieHong-Liu closed 8 months ago

JieHong-Liu commented 8 months ago

Hi, thanks for the efforts and the interesting project. When I try to do some experiment on ML-1m dataset, I found that the size of dataset is not as big as raw data.

After tracing your code, I found in "ARLib/data/clean/ml-1M/split.py", there is a if-else to select rating only bigger than 4.

with open('ratings.dat') as f:
    for line in f:
        items = line.strip().split('::')
        new_line = ' '.join(items[:-1])+'\n'
        if int(items[-2])<4:
            continue
        num=random.random()
        if num > 0.2:
            train.append(new_line)
        elif num > 0.1:
            val.append(new_line)
        else:
            test.append(new_line)

And I'm wondering about why you do that. Thanks again for collecting these model and attack method, it helps me a lot!!

CoderWZW commented 8 months ago

Hi, thanks for the efforts and the interesting project. When I try to do some experiment on ML-1m dataset, I found that the size of dataset is not as big as raw data.

After tracing your code, I found in "ARLib/data/clean/ml-1M/split.py", there is a if-else to select rating only bigger than 4.

with open('ratings.dat') as f:
    for line in f:
        items = line.strip().split('::')
        new_line = ' '.join(items[:-1])+'\n'
        if int(items[-2])<4:
            continue
        num=random.random()
        if num > 0.2:
            train.append(new_line)
        elif num > 0.1:
            val.append(new_line)
        else:
            test.append(new_line)

And I'm wondering about why you do that. Thanks again for collecting these model and attack method, it helps me a lot!!

Hello, thanks for your interest in ARLib.

As the ML-1M dataset contains only ratings data, we transform these ratings into implicit feedback. This involves interpreting items with high ratings as liked by the user. The threshold—be it a rating of 4 or 3—is up to your setting. Items with low ratings are considered disliked by the user and are not converted into implicit feedback.

It should be noted that this data preprocessing approach is just the method we have chosen to adopt. You can preprocess the data based on your own experimental settings. We hope our response is helpful to you, and we look forward to further communication.

JieHong-Liu commented 8 months ago

Hi, thanks again for your clearly explanation, it help us a lot !!