Closed josepius-clemson closed 1 year ago
dataset size should not be a problem by itself, how large is your vocab?
Vocab size: 1976192
On Sat, Mar 4, 2023 at 10:24 PM Federico Bianchi @.***> wrote:
dataset size should not be a problem by itself, how large is your vocab?
— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1454977425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUJVIU625MLP2J2COLTW2QBQBANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>
-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA
I think that's the issue, the vocab is probably too large.
Also note that CTM works better with very small vocab sizes, like 2k
I suppose a large dataset will have a large vocab too. .Pls correct me if I am wrong.
On Sat, Mar 4, 2023 at 10:34 PM Federico Bianchi @.***> wrote:
I think that's the issue, the vocab is probably too large.
Also note that CTM works better with very small vocab sizes, like 2k
— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1454979276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FULWJDT64UDEJCVNSBDW2QCTHANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>
-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA
Yes, you are right. to fix this you can keep only the most frequent words/bigrams (you often do not need to keep the entire vocab). You can also probably lemmatize to restrict the vocabulary even more
I did text pre-processing as below and got the len(vocab) as 1057. But I am still getting the memory error:
vectorizer = CountVectorizer(ngram_range=(2,2),min_df=900,max_df=0.50)
#from sklearn
Error:
0it [00:00, ?it/s]
---------------------------------------------------------------------------OSError
Traceback (most recent call
last)<ipython-input-18-777c581685e6> in <module> 4
training_dataset = CTMDataset(train_contextualized_embeddings,
train_bow_embeddings, id2token, labels=None) 5 ctm =
CombinedTM(bow_size=len(vocab), contextual_size=768,
n_components=12,num_epochs=25) # 50 topics----> 6
ctm.fit(training_dataset) # run the model 7
ctm.get_topic_lists(15)
~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py
in fit(self, train_dataset, validation_dataset, save_dir, verbose,
patience, delta, n_samples) 272 # train epoch 273
s = datetime.datetime.now()--> 274 sp, train_loss
= self._train_epoch(train_loader) 275 samples_processed
+= sp 276 e = datetime.datetime.now()
~/.local/lib/python3.9/site-packages/contextualized_topic_models/models/ctm.py
in _train_epoch(self, loader) 171 samples_processed = 0
172 --> 173 for batch_samples in loader: 174 #
batch_size x vocab_size 175 X_bow =
batch_samples['X_bow']
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in
__iter__(self) 433 return self._iterator 434
else:--> 435 return self._get_iterator() 436 437
@property
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in
_get_iterator(self) 379 else: 380
self.check_worker_number_rationality()--> 381 return
_MultiProcessingDataLoaderIter(self) 382 383 @property
~/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py in
__init__(self, loader) 1032 # before it starts, and
__del__ tries to join but will get: 1033 #
AssertionError: can only join a started process.-> 1034
w.start() 1035 self._index_queues.append(index_queue)
1036 self._workers.append(w)
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/process.py
in start(self) 119 'daemonic processes are not
allowed to have children' 120 _cleanup()--> 121
self._popen = self._Popen(self) 122 self._sentinel =
self._popen.sentinel 123 # Avoid a refcycle if the target
function holds an indirect
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py
in _Popen(process_obj) 222 @staticmethod 223 def
_Popen(process_obj):--> 224 return
_default_context.get_context().Process._Popen(process_obj) 225
226 class DefaultContext(BaseContext):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/context.py
in _Popen(process_obj) 275 def _Popen(process_obj): 276
from .popen_fork import Popen--> 277 return
Popen(process_obj) 278 279 class
SpawnProcess(process.BaseProcess):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py
in __init__(self, process_obj) 17 self.returncode = None
18 self.finalizer = None---> 19
self._launch(process_obj) 20 21 def
duplicate_for_child(self, fd):
/software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/multiprocessing/popen_fork.py
in _launch(self, process_obj) 64 parent_r, child_w =
os.pipe() 65 child_r, parent_w = os.pipe()---> 66
self.pid = os.fork() 67 if self.pid == 0: 68
try:
OSError: [Errno 12] Cannot allocate memory
,,,
On Sat, Mar 4, 2023 at 10:49 PM Federico Bianchi ***@***.***>
wrote:
> Yes, you are right. to fix this you can keep only the most frequent
> words/bigrams (you often do not need to keep the entire vocab). You can
> also probably lemmatize to restrict the vocabulary even more
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1454981629>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AQZ2FULOY5NA3ONAGLIY3O3W2QEMBANCNFSM6AAAAAAVP3YUWM>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
Could you try setting num_data_loader_workers=1
in CombinedTM or ZeroShotTM?
I just did and getting similar error:
0it [12:44, ?it/s] 0it [00:00, ?it/s]
OSError Traceback (most recent call last)
And this happens only with the 7GB dataset, am I right?
Yes, you are right
On Sun, Mar 5, 2023, 2:38 PM Federico Bianchi @.***> wrote:
And this happens only with the 7GB dataset, am I right?
— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1455183528, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUM3ZEJLI4E7R3PLQPDW2TTTPANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>
I am currently not sure about what could be causing the problem, but I'll look into this
Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.
Yea, it should work (even if it's not the best one)
On Fri, Mar 10, 2023, 21:52 josepius-clemson @.***> wrote:
Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.
— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1464835359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARBSS7QJ2RI37J26ZXLDADW3QHKVANCNFSM6AAAAAAVP3YUWM . You are receiving this because you commented.Message ID: @.***>
Ok. Kindly let me know if you find a solution for the best one.
On Sat, Mar 11, 2023 at 9:29 AM Federico Bianchi @.***> wrote:
Yea, it should work (even if it's not the best one)
On Fri, Mar 10, 2023, 21:52 josepius-clemson @.***> wrote:
Hi, It worked on large dataset when I tried SentenceTransformer("bert-base-nli-mean-tokens") for creating contextual embedding. Hope its fine to use it for building the training dataset. Pls confirm.
— Reply to this email directly, view it on GitHub < https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1464835359 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AARBSS7QJ2RI37J26ZXLDADW3QHKVANCNFSM6AAAAAAVP3YUWM
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/129#issuecomment-1464923769, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUMY4YWQGOGBIWIL5CTW3SD6BANCNFSM6AAAAAAVP3YUWM . You are receiving this because you authored the thread.Message ID: @.***>
-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA
Description
I am working on a text dataset of size 7.3 GB. I could run CTM successfully up to 2 to 3 GB with 1 GPU. And I liked the topics generated by the model. But failing to run with >3 GB dataset. I tried to run it with only RAM as program is not allowing me to run on multiple GPUs . Although I had RAM of 200 GB , I am getting memory allocation error. Could you please advice me how to overcome this issue?
What I Did