aviv16 / Twitter_Rumor_Detection

Final Project at my SW Engineering studies. An RNN using LSTM ML DL model.
3 stars 0 forks source link

Using glove Embedding #2

Open ArwaDS opened 1 year ago

ArwaDS commented 1 year ago

I wanted to try different tweets embedding techniques such as GloVe, as a first step i add a tokenizer functions : ` def tokenposts(posts):

tokenizer = Tokenizer(num_words=700) tokenizer.fit_on_texts(posts) posts = tokenizer.texts_to_sequences(posts) word_index = tokenizer.word_index vocab_size = len(tokenizer.word_index) + 1 posts= pad_sequences(posts,padding='post',maxlen=70) return posts , word_index, vocab_size `

and i edited the getdata function to be as follow

`def getData(dataset_file):

data, classification, event_ids = splitFileToEvents(dataset_file)
_x_train = np.array([])
_y_train = np.array([])
_y_train1 = np.array([])

all_posts, padded_posts_to_event_count = getAllEvents(data)
posts , word_index, vocab_size = tokenposts(all_posts)

print(posts[0])

i = 0
prev_posts_num = 0
# the entire data TF-IDF is calculated. now splitting to post series
for event in data:

    #print('~~~~~~~~~~ Event id = ' + event_ids[i] + ' ~~~~~~~~~~')
    #print('~~~~~~~~~~ Num of posts = ' + str(event.__len__()) + ' ~~~~~~~~~~')
    event_tf_idf = posts[prev_posts_num:prev_posts_num + len(event)]
    prev_posts_num = prev_posts_num + len(event)
    post_series = createPostSeries(event_tf_idf)
    if i == 0:
        _x_train = np.array(post_series)
    else:
        _x_train = np.append(_x_train, post_series, axis=0)
    # save the corresponding classification of each post series
    _y_train = np.append(_y_train, len(post_series) * [int(classification[i])], axis=0)
    i += 1
return _x_train, _y_train, word_index, vocab_size

`

can you pleas confirm that all things are correct because i'm still learning and i thing there is something is wrong my data set contains 100,000 tweets grouped in 44 events before the preprocessing and the X_train shape after running the code became (66244, 50, 70) , 50 is number of posts for each event subset and 70 the maximum number of word per post

ArwaDS commented 1 year ago

I figure out that the Min set to 5 and N have been set to 50, what is the use of Min since we set number of posts in each chunk to be 50? because I realized that the events expanded to have all almost the same number of posts and that consume high amount of RAM