Big dataset of unsolicited reviews found

Mahmoud-s-programs commented 2 weeks ago

I have written a Python script that filters out explicit terms and stores the data in a new file with the same format. Consequently, the number of reviews dropped from 63000 to 11000. I checked the results and surely enough the remaining reviews do not contain any of the explicit terms that were specified in the script including those that have suffixes and prefixes.
filtering_explicits

My issue is that some of the terms are not aspects as noted by Dr. Fani. Here is the list of terms in English that resulted in a deletion of review: كتاب - book كتب - books مؤلف - author (male) كاتب - writer (male) مؤلفة - author (female) كاتبة - writer (female) رواية - novel روايات - novels قصة - story قصص - stories حكاية - tale حكايات - tales مجلد - volume جزء - part or section فصول - chapters فصل - chapter شخصية - character بطل - hero or protagonist (male) بطلة - heroine or protagonist (female) أبطال - heroes or protagonists عدو - enemy or antagonist أعداء - enemies or antagonists صديق - friend (male) صديقة - friend (female) أصدقاء - friends حبكة - plot حدث - event أحداث - events نهاية - ending or conclusion بداية - beginning or start ذروة - climax حل - resolution or solution عقدة - conflict or knot (in a story) أسلوب - style لغة - language تعبير - expression وصف - description سرد - narration حوار - dialogue كلمة - word جملة - sentence مفردات - vocabulary مصطلحات - terminology or terms As you can see, some of the terms may be implicit or aspects that would affect the review.

hosseinfani commented 2 weeks ago

@Mahmoud-s-programs thank you. you can also paste your code here (not image), also the link to the dataset git, etc.

@Sepideh-Ahmadian Do you think we can assume that these reviews can have the implicit aspect term as we can remove the book name (from the book id) from the review text?

Mahmoud-s-programs commented 2 weeks ago

dataset link [https://github.com/mohamedadaly/LABR/tree/master/data] filename: reviews.tsv

code:

import pandas as pd
import re
#Loading the dataset
column_names = ['rating', 'review_id', 'user_id', 'book_id', 'review']
df = pd.read_csv('arabic_reviews.tsv', sep='\t', names=column_names, encoding='utf-8')
df['review'] = df['review'].astype(str)
df.dropna(subset=['review'], inplace=True)

#Defining and normalizing explicit terms
explicit_terms = [
    'كتاب', 'كتب', 'مؤلف', 'كاتب', 'مؤلفة', 'كاتبة', 'رواية', 'روايات', 'قصة', 'قصص',
    'حكاية', 'حكايات', 'مجلد', 'جزء', 'فصول', 'فصل',
    'شخصية', 'بطل', 'بطلة', 'ابطال', 'عدو', 'اعداء', 'صديق', 'صديقة', 'اصدقاء',
    'حبكة', 'حدث', 'احداث', 'نهاية', 'بداية', 'ذروة', 'حل', 'عقدة',
    'اسلوب', 'لغة', 'تعبير', 'وصف', 'سرد', 'حوار', 'كلمة', 'جملة', 'مفردات', 'مصطلحات', 'مسرحية'
]

def normalize_arabic(text):
    # Normalizing certain letters
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "و", text)
    text = re.sub("ئ", "ي", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

explicit_terms_normalized = [normalize_arabic(term) for term in explicit_terms]

#Creating regex patterns for explicit terms with prefixes and suffixes
prefixes = ['ال', 'و', 'ف', 'ب', 'ك', 'ل', 'لل', 'س']  # Common Arabic prefixes
suffixes = ['ه', 'ها', 'ك', 'ي', 'هم', 'نا', 'كم', 'هن', 'تن', 'ا', 'ات', 'ون', 'ين', 'ان', 'ين', 'ه', 'ة']  # Common Arabic suffixes

def create_regex_pattern(term):
    prefix_pattern = '(' + '|'.join(prefixes) + ')?'  # Optional prefixes
    suffix_pattern = '(' + '|'.join(suffixes) + ')?'  # Optional suffixes
    pattern = prefix_pattern + term + suffix_pattern
    return pattern

regex_patterns = [create_regex_pattern(term) for term in explicit_terms_normalized]
combined_pattern = '|'.join(regex_patterns)
compiled_pattern = re.compile(combined_pattern)
#Filtering out reviews containing explicit terms
def contains_explicit_term(review):
    review_normalized = normalize_arabic(review)
    return bool(compiled_pattern.search(review_normalized))

mask = ~df['review'].apply(contains_explicit_term)
df_filtered = df[mask].reset_index(drop=True)

#Storing the filtered dataset
df_filtered.to_csv('arabic_reviews_filtered.tsv', sep='\t', index=False, header=False, encoding='utf-8')

Sepideh-Ahmadian commented 1 week ago

I believe the suggested solution for this problem may not be effective, as it is a highly delicate matter. Addressing aspects cannot be achieved through a predefined set of words. In some cases, determining the aspect and identifying whether it is implicit or not is far more complex than this approach suggests.

fani-lab / LADy

Big dataset of unsolicited reviews found #98