MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.21k stars 147 forks source link

How to work with Large dataset? #129

Closed josepius-clemson closed 1 year ago

josepius-clemson commented 1 year ago

Description

I am working on a text dataset of size 7.3 GB. I could run CTM successfully up to 2 to 3 GB with 1 GPU. And I liked the topics generated by the model. But failing to run with >3 GB dataset. I tried to run it with only RAM as program is not allowing me to run on multiple GPUs . Although I had RAM of 200 GB , I am getting memory allocation error. Could you please advice me how to overcome this issue?

What I Did

Paste the command(s) you ran and the output.
UserWarning: This DataLoader will create 40 worker processes in total. Our suggested max number of worker in current system is 24, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(

--------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-777c581685e6> in <module>
      4 training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None)
      5 ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25) # 50 topics
----> 6 ctm.fit(training_dataset) # run the model
      7 ctm.get_topic_lists(15)

~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in fit(self, train_dataset, validation_dataset, save_dir, verbose, patience, delta, n_samples)
    272             # train epoch
    273             s = datetime.datetime.now()
--> 274             sp, train_loss = self._train_epoch(train_loader)
    275             samples_processed += sp
    276             e = datetime.datetime.now()

~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in _train_epoch(self, loader)
    171         samples_processed = 0
    172 
--> 173         for batch_samples in loader:
    174             # batch_size x vocab_size
    175             X_bow = batch_samples['X_bow']

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __iter__(self)
    433             return self._iterator
    434         else:
--> 435             return self._get_iterator()
    436 
    437     @property

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _get_iterator(self)
    379         else:
    380             self.check_worker_number_rationality()
--> 381             return _MultiProcessingDataLoaderIter(self)
    382 
    383     @property

~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __init__(self, loader)
   1032             #     before it starts, and __del__ tries to join but will get:
   1033             #     AssertionError: can only join a started process.
-> 1034             w.start()
   1035             self._index_queues.append(index_queue)
   1036             self._workers.append(w)

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py in start(self)
    119                'daemonic processes are not allowed to have children'
    120         _cleanup()
--> 121         self._popen = self._Popen(self)
    122         self._sentinel = self._popen.sentinel
    123         # Avoid a refcycle if the target function holds an indirect

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
    222     @staticmethod
    223     def _Popen(process_obj):
--> 224         return _default_context.get_context().Process._Popen(process_obj)
    225 
    226 class DefaultContext(BaseContext):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj)
    275         def _Popen(process_obj):
    276             from .popen_fork import Popen
--> 277             return Popen(process_obj)
    278 
    279     class SpawnProcess(process.BaseProcess):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in __init__(self, process_obj)
     17         self.returncode = None
     18         self.finalizer = None
---> 19         self._launch(process_obj)
     20 
     21     def duplicate_for_child(self, fd):

/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in _launch(self, process_obj)
     64         parent_r, child_w = os.pipe()
     65         child_r, parent_w = os.pipe()
---> 66         self.pid = os.fork()
     67         if self.pid == 0:
     68             try:
OSError: [Errno 12] Cannot allocate memory
vinid commented 1 year ago

dataset size should not be a problem by itself, how large is your vocab?

josepius-clemson commented 1 year ago

Vocab size: 1976192

On Sat, Mar 4, 2023 at 10:24 PM Federico Bianchi @.***> wrote:

dataset size should not be a problem by itself, how large is your vocab?

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1454977425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUJVIU625MLP2J2COLTW2QBQBANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA

vinid commented 1 year ago

I think that's the issue, the vocab is probably too large.

Also note that CTM works better with very small vocab sizes, like 2k

josepius-clemson commented 1 year ago

I suppose a large dataset will have a large vocab too. .Pls correct me if I am wrong.

On Sat, Mar 4, 2023 at 10:34 PM Federico Bianchi @.***> wrote:

I think that's the issue, the vocab is probably too large.

Also note that CTM works better with very small vocab sizes, like 2k

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1454979276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FULWJDT64UDEJCVNSBDW2QCTHANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA

vinid commented 1 year ago

Yes, you are right. to fix this you can keep only the most frequent words/bigrams (you often do not need to keep the entire vocab). You can also probably lemmatize to restrict the vocabulary even more

josepius-clemson commented 1 year ago

I did text pre-processing as below and got the len(vocab) as 1057. But I am still getting the memory error:

vectorizer = CountVectorizer(ngram_range=(2,2),min_df=900,max_df=0.50)
#from sklearn

Error:


0it [00:00, ?it/s]

---------------------------------------------------------------------------OSError
                                  Traceback (most recent call
last)<ipython-input-18-777c581685e6> in <module>      4
training_dataset = CTMDataset(train_contextualized_embeddings,
train_bow_embeddings, id2token, labels=None)      5 ctm =
CombinedTM(bow_size=len(vocab), contextual_size=768,
n_components=12,num_epochs=25) # 50 topics----> 6
ctm.fit(training_dataset) # run the model      7
ctm.get_topic_lists(15)
~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py
in fit(self, train_dataset, validation_dataset, save_dir, verbose,
patience, delta, n_samples)    272             # train epoch    273
         s = datetime.datetime.now()--> 274             sp, train_loss
= self._train_epoch(train_loader)    275             samples_processed
+= sp    276             e = datetime.datetime.now()
~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py
in _train_epoch(self, loader)    171         samples_processed = 0
172 --> 173         for batch_samples in loader:    174             #
batch_size x vocab_size    175             X_bow =
batch_samples['X_bow']
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in
__iter__(self)    433             return self._iterator    434
else:--> 435             return self._get_iterator()    436     437
 @property
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in
_get_iterator(self)    379         else:    380
self.check_worker_number_rationality()--> 381             return
_MultiProcessingDataLoaderIter(self)    382     383     @property
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in
__init__(self, loader)   1032             #     before it starts, and
__del__ tries to join but will get:   1033             #
AssertionError: can only join a started process.-> 1034
w.start()   1035             self._index_queues.append(index_queue)
1036             self._workers.append(w)
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py
in start(self)    119                'daemonic processes are not
allowed to have children'    120         _cleanup()--> 121
self._popen = self._Popen(self)    122         self._sentinel =
self._popen.sentinel    123         # Avoid a refcycle if the target
function holds an indirect
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py
in _Popen(process_obj)    222     @staticmethod    223     def
_Popen(process_obj):--> 224         return
_default_context.get_context().Process._Popen(process_obj)    225
226 class DefaultContext(BaseContext):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py
in _Popen(process_obj)    275         def _Popen(process_obj):    276
           from .popen_fork import Popen--> 277             return
Popen(process_obj)    278     279     class
SpawnProcess(process.BaseProcess):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py
in __init__(self, process_obj)     17         self.returncode = None
  18         self.finalizer = None---> 19
self._launch(process_obj)     20      21     def
duplicate_for_child(self, fd):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py
in _launch(self, process_obj)     64         parent_r, child_w =
os.pipe()     65         child_r, parent_w = os.pipe()---> 66
self.pid = os.fork()     67         if self.pid == 0:     68
  try:
OSError: [Errno 12] Cannot allocate memory

,,,

On Sat, Mar 4, 2023 at 10:49 PM Federico Bianchi ***@***.***>
wrote:

> Yes, you are right. to fix this you can keep only the most frequent
> words/bigrams (you often do not need to keep the entire vocab). You can
> also probably lemmatize to restrict the vocabulary even more
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1454981629>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AQZ2FULOY5NA3ONAGLIY3O3W2QEMBANCNFSM6AAAAAAVP3YUWM>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
vinid commented 1 year ago

Could you try setting num_data_loader_workers=1 in CombinedTM or ZeroShotTM?

josepius-clemson commented 1 year ago

I just did and getting similar error:

0it [12:44, ?it/s] 0it [00:00, ?it/s]


OSError Traceback (most recent call last)

in 4 training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None) 5 ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25,num_data_loader_workers=1) # 50 topics ----> 6 ctm.fit(training_dataset) # run the model 7 ctm.get_topic_lists(15) ~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in fit(self, train_dataset, validation_dataset, save_dir, verbose, patience, delta, n_samples) 272 # train epoch 273 s = datetime.datetime.now() --> 274 sp, train_loss = self._train_epoch(train_loader) 275 samples_processed += sp 276 e = datetime.datetime.now() ~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py in _train_epoch(self, loader) 171 samples_processed = 0 172 --> 173 for batch_samples in loader: 174 # batch_size x vocab_size 175 X_bow = batch_samples['X_bow'] ~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __iter__(self) 433 return self._iterator 434 else: --> 435 return self._get_iterator() 436 437 @property ~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in _get_iterator(self) 379 else: 380 self.check_worker_number_rationality() --> 381 return _MultiProcessingDataLoaderIter(self) 382 383 @property ~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in __init__(self, loader) 1032 # before it starts, and __del__ tries to join but will get: 1033 # AssertionError: can only join a started process. -> 1034 w.start() 1035 self._index_queues.append(index_queue) 1036 self._workers.append(w) /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py in start(self) 119 'daemonic processes are not allowed to have children' 120 _cleanup() --> 121 self._popen = self._Popen(self) 122 self._sentinel = self._popen.sentinel 123 # Avoid a refcycle if the target function holds an indirect /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj) 222 @staticmethod 223 def _Popen(process_obj): --> 224 return _default_context.get_context().Process._Popen(process_obj) 225 226 class DefaultContext(BaseContext): /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py in _Popen(process_obj) 275 def _Popen(process_obj): 276 from .popen_fork import Popen --> 277 return Popen(process_obj) 278 279 class SpawnProcess(process.BaseProcess): /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in __init__(self, process_obj) 17 self.returncode = None 18 self.finalizer = None ---> 19 self._launch(process_obj) 20 21 def duplicate_for_child(self, fd): /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py in _launch(self, process_obj) 64 parent_r, child_w = os.pipe() 65 child_r, parent_w = os.pipe() ---> 66 self.pid = os.fork() 67 if self.pid == 0: 68 try: OSError: [Errno 12] Cannot allocate memory
vinid commented 1 year ago

And this happens only with the 7GB dataset, am I right?

josepius-clemson commented 1 year ago

Yes, you are right

On Sun, Mar 5, 2023, 2:38 PM Federico Bianchi @.***> wrote:

And this happens only with the 7GB dataset, am I right?

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1455183528, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUM3ZEJLI4E7R3PLQPDW2TTTPANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>

vinid commented 1 year ago

I am currently not sure about what could be causing the problem, but I'll look into this

josepius-clemson commented 1 year ago

Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.

vinid commented 1 year ago

Yea, it should work (even if it's not the best one)

On Fri, Mar 10, 2023, 21:52 josepius-clemson @.***> wrote:

Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1464835359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARBSS7QJ2RI37J26ZXLDADW3QHKVANCNFSM6AAAAAAVP3YUWM . You are receiving this because you commented.Message ID: @.***>

josepius-clemson commented 1 year ago

Ok. Kindly let me know if you find a solution for the best one.

On Sat, Mar 11, 2023 at 9:29 AM Federico Bianchi @.***> wrote:

Yea, it should work (even if it's not the best one)

On Fri, Mar 10, 2023, 21:52 josepius-clemson @.***> wrote:

Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.

— Reply to this email directly, view it on GitHub < https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1464835359 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AARBSS7QJ2RI37J26ZXLDADW3QHKVANCNFSM6AAAAAAVP3YUWM

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1464923769, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUMY4YWQGOGBIWIL5CTW3SD6BANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA