Closed kmezhoud closed 2 years ago
When I run the tutorial08 from a script, I got this:
python3.6 full_pipeline_melusine.py
16/09 12:16 - melusine.nlp_tools.phraser - INFO - Start training for colocation detector
16/09 12:16 - melusine.nlp_tools.phraser - INFO - Done.
2021-09-16 12:16:14.114440: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-16 12:16:14.114489: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "full_pipeline_melusine.py", line 122, in <module>
nn_model.fit(X,y)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 241, in fit
X_input_train, y_categorical_train = self._prepare_data(X_train, y_train)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 494, in _prepare_data
self._get_embedding_matrix()
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 385, in _get_embedding_matrix
self.vocabulary = pretrained_embedding.embedding.index_to_key
AttributeError: 'NoneType' object has no attribute 'index_to_key'
Hi Kirus, thank you for this detailled issue. The team plans on brainstorming next week to handle the open issues. We will take time to go through yours and provide for a solution.
For now, on the second point, i think I already encountered this issue and fixed it by replacing line 385 self.vocabulary = pretrained_embedding.embedding.index_to_key by self.vocabulary = pretrained_embedding.embedding This will require you to call melusine locally to modify the code. This is due to Gensim upgrade. Let us know if this works. It is just a quick thought but we will go through it more deeply soon to provide with a complete solution.
Thank you for your help and patience. Best regards.
awk '{if(NR==385) print $0}' /home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py
self.vocabulary = pretrained_embedding.embedding
python3.6 full_pipeline_melusine.py
16/09 03:47 - melusine.nlp_tools.phraser - INFO - Start training for colocation detector
16/09 03:47 - melusine.nlp_tools.phraser - INFO - Done.
2021-09-16 15:47:51.936383: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-16 15:47:51.936420: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "full_pipeline_melusine.py", line 122, in <module>
nn_model.fit(X,y)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 241, in fit
X_input_train, y_categorical_train = self._prepare_data(X_train, y_train)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 494, in _prepare_data
self._get_embedding_matrix()
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 386, in _get_embedding_matrix
vocab_size = len(self.vocabulary)
TypeError: object of type 'NoneType' has no len()
Thank you for your feedback. We need to investigate more deeply then. We will come back to you soon. Best,
Hello,
I tried to reproduce your bug as follows:
I was not able to reproduce your bug :/
From the error you posted it seems like the embedding
attribute of your pretrained_embedding
object has a None value which is surprising.
Could you make the following check ?
print(type(pretrained_embedding)) # This should be melusine.nlp_tools.embedding.Embedding
print(type(pretrained_embedding.embedding)) # This should be gensim.models.keyedvectors.KeyedVectors
print(type(pretrained_embedding.embedding.index_to_key)) # This should be list
If the Gensim object is None, your bug probably comes from the embedding training.
I hope this helps, let us how it goes :)
Thanks! I will focus on tutorial 04 to check embedding process versus my install. I added the 3 prints in .py file and rerun the code. It gives the same error without any prints about the 3 objects. thanks
here is the file
#!/usr/bin/ python3.6
from melusine.data.data_loader import load_email_data
import ast
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from melusine.utils.multiprocessing import apply_by_multiprocessing
from melusine.utils.transformer_scheduler import TransformerScheduler
from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer
from melusine.prepare_email.build_historic import build_historic
from melusine.prepare_email.mail_segmenting import structure_email
from melusine.prepare_email.body_header_extraction import extract_last_body
from melusine.prepare_email.cleaning import clean_body
from melusine.prepare_email.cleaning import clean_header
from melusine.nlp_tools.phraser import Phraser
from melusine.nlp_tools.phraser import phraser_on_body
from melusine.nlp_tools.phraser import phraser_on_header
from melusine.nlp_tools.tokenizer import Tokenizer
from melusine.nlp_tools.embedding import Embedding
from melusine.summarizer.keywords_generator import KeywordsGenerator
from melusine.prepare_email.metadata_engineering import MetaExtension
from melusine.prepare_email.metadata_engineering import MetaDate
from melusine.prepare_email.metadata_engineering import MetaAttachmentType
from melusine.prepare_email.metadata_engineering import Dummifier
df_emails = load_email_data()
df_emails['attachment'] = df_emails['attachment'].apply(ast.literal_eval)
ManageTransferReplyTransformer = TransformerScheduler(
functions_scheduler=[
(check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
(update_info_for_transfer_mail, None, None),
(add_boolean_answer, None, ['is_answer']),
(add_boolean_transfer, None, ['is_transfer'])
]
)
df_emails = ManageTransferReplyTransformer.fit_transform(df_emails)
SegmentingTransformer = TransformerScheduler(
functions_scheduler=[
(build_historic, None, ['structured_historic']),
(structure_email, None, ['structured_body'])
]
)
df_emails = SegmentingTransformer.fit_transform(df_emails)
LastBodyHeaderCleaningTransformer = TransformerScheduler(
functions_scheduler=[
(extract_last_body, None, ['last_body']),
(clean_body, None, ['clean_body'])
]
)
df_emails = LastBodyHeaderCleaningTransformer.fit_transform(df_emails)
phraser = Phraser()
phraser.train(df_emails)
PhraserTransformer = TransformerScheduler(
functions_scheduler=[
(phraser_on_body, (phraser,), ['clean_body'])
]
)
df_emails = PhraserTransformer.fit_transform(df_emails)
tokenizer = Tokenizer(input_column="clean_body")
df_emails = tokenizer.fit_transform(df_emails)
# Pipeline to extract dummified metadata
MetadataPipeline = Pipeline([
('MetaExtension', MetaExtension()),
('MetaDate', MetaDate()),
('MetaAttachmentType', MetaAttachmentType()),
('Dummifier', Dummifier())
])
df_meta = MetadataPipeline.fit_transform(df_emails)
keywords_generator = KeywordsGenerator(n_max_keywords=4)
df_emails = keywords_generator.fit_transform(df_emails)
pretrained_embedding = Embedding(input_column='clean_body',
workers=1,
min_count=5)
print(type(pretrained_embedding))
import pandas as pd
from sklearn.preprocessing import LabelEncoder
X = pd.concat([df_emails['clean_body'],df_meta],axis=1)
y = df_emails['label']
le = LabelEncoder()
y = le.fit_transform(y)
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel
nn_model = NeuralModel(architecture_function=cnn_model,
pretrained_embedding=pretrained_embedding,
text_input_column="clean_body",
meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
n_epochs=10)
nn_model.fit(X,y)
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res
print(y_res)
print(type(pretrained_embedding)) # This should be melusine.nlp_tools.embedding.Embedding
print(type(pretrained_embedding.embedding)) # This should be gensim.models.keyedvectors.KeyedVectors
print(type(pretrained_embedding.embedding.index_to_key)) # This should be list
import numpy as np
import pandas as pd
prediction = pd.DataFrame(y_res, columns=['predictions']).to_csv('prediction.csv')
Hello @kmezhoud, I think I found where your problem comes from. Have you trained your embedding object ?
When you instantiate a (Melusine) Embedding
object, the attribute Embedding.embedding
is initialized at None.
(Seems to be your problem)
When you run the train
method, the Embedding.embedding
attribute is set to a gensim.models.keyedvectors.KeyedVectors
object.
Hope it helps
Hi, I guess training the embedding before instanciating a NeuralModel should fix the bug. I'll close the issue, feel free to re-open if needed.
Dear all, I can not run the following training from R environment.
When I import
cnn_model
class, I got:Any idea is welcome. Thanks
I have libcudart9.1 and not libcudart.so10.1.
I do not have GPU set up, nor nvidia card
Python version : 3.6
Melusine version : 2.3.1
Operating System : 18.04.5 LTS on Macbook Pro Retina
Python Module versions