MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.21k stars 147 forks source link

customize torch/transformers cache directory path #105

Closed ShuzhouYuan closed 2 years ago

ShuzhouYuan commented 2 years ago

Hello! Since I'm using the server and I don't have the permission of the default cache directory, I always got an error Permission Denied of the default cache directory. Do you have a solution to customize the cache directory like other transformer models using cache_dir='your/cache/path'? I tried this but it seems it's not a parameter of your model lol Thank you very much!

vinid commented 2 years ago

Hi!

Could you tell me where this issue arise?

ShuzhouYuan commented 2 years ago

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

vinid commented 2 years ago

Thanks, could you also share the stack trace?

On Tue, Feb 8, 2022, 11:37 ShuzhouYuan @.***> wrote:

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/105#issuecomment-1032460193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARBSS3OTMB2TMMCUP3O7HDU2DW7JANCNFSM5N2AUKFQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

ShuzhouYuan commented 2 years ago
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-8-e866ba0b7c0c> in <module>
----> 1 training_dataset = qt.fit(text_for_contextual=unpreprocessed_documents, text_for_bow=preprocessed_documents)

~.local/lib/python3.6/site-packages/contextualized_topic_models/utils/data_preparation.py in fit(self, text_for_contextual, text_for_bow, labels)
     67 
     68         train_bow_embeddings = self.vectorizer.fit_transform(text_for_bow)
---> 69         train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, self.contextualized_model)
     70         self.vocab = self.vectorizer.get_feature_names()
     71         self.id2token = {k: v for k, v in zip(range(0, len(self.vocab)), self.vocab)}

~.local/lib/python3.6/site-packages/contextualized_topic_models/utils/data_preparation.py in bert_embeddings_from_list(texts, sbert_model_to_load, batch_size)
     33     Creates SBERT Embeddings from a list
     34     """
---> 35     model = SentenceTransformer(sbert_model_to_load)
     36     return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
     37 

~.local/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py in __init__(self, model_name_or_path, modules, device, cache_folder)
     82                                     library_name='sentence-transformers',
     83                                     library_version=__version__,
---> 84                                     ignore_files=['flax_model.msgpack', 'rust_model.ot', 'tf_model.h5'])
     85 
     86             if os.path.exists(os.path.join(model_path, 'modules.json')):    #Load as SentenceTransformer model

~.local/lib/python3.6/site-packages/sentence_transformers/util.py in snapshot_download(repo_id, revision, cache_dir, library_name, library_version, user_agent, ignore_files)
    450             os.path.join(storage_folder, relative_filepath)
    451         )
--> 452         os.makedirs(nested_dirname, exist_ok=True)
    453 
    454         path = cached_download(

~usr/lib/python3.6/os.py in makedirs(name, mode, exist_ok)
    208     if head and tail and not path.exists(head):
    209         try:
--> 210             makedirs(head, mode, exist_ok)
    211         except FileExistsError:
    212             # Defeats race condition when another thread created the path

~usr/lib/python3.6/os.py in makedirs(name, mode, exist_ok)
    208     if head and tail and not path.exists(head):
    209         try:
--> 210             makedirs(head, mode, exist_ok)
    211         except FileExistsError:
    212             # Defeats race condition when another thread created the path

~usr/lib/python3.6/os.py in makedirs(name, mode, exist_ok)
    208     if head and tail and not path.exists(head):
    209         try:
--> 210             makedirs(head, mode, exist_ok)
    211         except FileExistsError:
    212             # Defeats race condition when another thread created the path

~usr/lib/python3.6/os.py in makedirs(name, mode, exist_ok)
    218             return
    219     try:
--> 220         mkdir(name, mode)
    221     except OSError:
    222         # Cannot rely on checking for EEXIST, since the operating system

PermissionError: [Errno 13] Permission denied: '/.cache'

I think the problem is that I don't have permission to write the default cache directory on the server. I had the same errors before with transformer models, what I did is customizing the cache directory by:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', cache_dir='your/cache/directory')

Is there maybe somewhere I can also change the path of the cache directory? Thanks!

ShuzhouYuan commented 2 years ago

I've found a solution:

import os
os.environ['TORCH_HOME'] = 'your/cache/path'
vinid commented 2 years ago

Wow! nice! :)

happy you solved the problem :)