Open ArwaDS opened 1 year ago
I figure out that the Min set to 5 and N have been set to 50, what is the use of Min since we set number of posts in each chunk to be 50? because I realized that the events expanded to have all almost the same number of posts and that consume high amount of RAM
I wanted to try different tweets embedding techniques such as GloVe, as a first step i add a tokenizer functions : ` def tokenposts(posts):
tokenizer = Tokenizer(num_words=700) tokenizer.fit_on_texts(posts) posts = tokenizer.texts_to_sequences(posts) word_index = tokenizer.word_index vocab_size = len(tokenizer.word_index) + 1 posts= pad_sequences(posts,padding='post',maxlen=70) return posts , word_index, vocab_size `
and i edited the getdata function to be as follow
`def getData(dataset_file):
`
can you pleas confirm that all things are correct because i'm still learning and i thing there is something is wrong my data set contains 100,000 tweets grouped in 44 events before the preprocessing and the X_train shape after running the code became (66244, 50, 70) , 50 is number of posts for each event subset and 70 the maximum number of word per post