DeepsMoseli / Bidirectiona-LSTM-for-text-summarization-

A bidirectional encoder-decoder LSTM neural network is trained for text summarization on the cnn/dailymail dataset. (MIT808 project)
MIT License
80 stars 36 forks source link

Word2vec.py #2

Open sanjayb678 opened 6 years ago

sanjayb678 commented 6 years ago

In the function word2vecmodel, the model that is saved as word2vec throws error that it’s not UTF-8 encoded. Saving disabled. FYI I’m running the code using Jupyter notebook. Thanks in advance. Let me know if I’m doing something wrong.

DeepsMoseli commented 6 years ago

Hi @sanjayb678 , I wrote and ran the whole script in spyder(python 3.6). i would advise you first keep the same configuration and I did not test is the code works exactly the same in a notebook. saving shouldn't be a problem as far as I know. however you can skip over this line as long as the model is in memory.

PratikNalage commented 5 years ago

1) Where is the pre-trained model for Word2Vec?

2) Error in Word2Vec.py

  File "word2vec.py", line 178, in <module>
    corpus = createCorpus(data)
NameError: name 'data' is not defined
amanjaswani commented 5 years ago

why have you done label_encoder,onehot_encoded,onehot=summonehot(data["summaries"])

shouldn't the function argument by corpus instead of data["summaries"]??

MuruganR96 commented 5 years ago

@PratikNalage

cnn_daily_load.py you too create function like this,

def cnn_daily_load():
    filenames=load_data(datasets["cnn"],data_categories[0])

    """----------load the data, sentences and summaries-----------"""
    for k in range(len(filenames[:400])):
            if k%2==0:
                try:
                    data["articles"].append(cleantext(parsetext(datasets["cnn"],data_categories[0],"%s"%filenames[k])))
                except Exception as e:
                    data["articles"].append("Could not read")
                    print(e)
            else:
                try:
                    data["summaries"].append(cleantext(parsetext(datasets["cnn"],data_categories[0],"%s"%filenames[k])))
                except Exception as e:
                    data["summaries"].append("Could not read")
                    print(e)
    return data

then simply import cnn_daily_load.py to word2vec.py

from cnn_daily_load import cnn_daily_load, cleantext
data = cnn_daily_load()

your first question is Where is the pre-trained model for Word2Vec?

i think,

Actually simply we are using skipgram model algorithm to generate own word embeddings. that is why we no need for word2vec pre-trained model. it is another way of generating word embedding.

DeepsMoseli commented 5 years ago

Pre trained because I do not train it together with the neural network. I do pretrain skipgram

On Fri, 12 Jul 2019, 09:07 Murugan R, notifications@github.com wrote:

@PratikNalage https://github.com/PratikNalage

cnn_daily_load.py you too create function like this,

def cnn_daily_load(): filenames=load_data(datasets["cnn"],data_categories[0])

"""----------load the data, sentences and summaries-----------"""
for k in range(len(filenames[:400])):
        if k%2==0:
            try:
                data["articles"].append(cleantext(parsetext(datasets["cnn"],data_categories[0],"%s"%filenames[k])))
            except Exception as e:
                data["articles"].append("Could not read")
                print(e)
        else:
            try:
                data["summaries"].append(cleantext(parsetext(datasets["cnn"],data_categories[0],"%s"%filenames[k])))
            except Exception as e:
                data["summaries"].append("Could not read")
                print(e)
return data

then simply import cnn_daily_load.py to word2vec.py

from cnn_daily_load import cnn_daily_load, cleantext data = cnn_daily_load()

your first question is Where is the pre-trained model for Word2Vec?

Actually simply we are using skipgram model algorithm to generate own word embeddings. that is why we no need for word2vec pre-trained model. it is another way of generating word embedding.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DeepsMoseli/Bidirectiona-LSTM-for-text-summarization-/issues/2?email_source=notifications&email_token=AG5XOUV3RX5UNKH7L4O4JC3P7AUTRA5CNFSM4FPTU3HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZY4VRI#issuecomment-510773957, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5XOUTORFPCKZ2QXF3NYH3P7AUTRANCNFSM4FPTU3HA .

-- This message and attachments are subject to a disclaimer. Please refer to

http://www.it.up.ac.za/documentation/governance/disclaimer/ http://www.it.up.ac.za/documentation/governance/disclaimer/ for full details.

MuruganR96 commented 5 years ago

not really @DeepsMoseli. in this place you are using gensim - skipgram algorithm(word2vec) to build normal word2vec model then generating embedding for the words. training from scratch.

great stuff. we never used word2vec pretrained model here.

@amanjaswani i was not understood your question, but give you a hand.

label_encoder,onehot_encoded,onehot=summonehot(data["summaries"])

label_encoder for training label. word2vec embedding for training data.