biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
126 stars 81 forks source link

Topic Modeling Crash #820

Closed layudhi closed 10 months ago

layudhi commented 2 years ago

What's wrong?

Topic Modeling Crash when used with Twitter Widget

How can we reproduce the problem?

Please see the Screen Shoot, i cannot save the owl with Topic Modeling connected since the App will freeze and then Not Responding after connecting the Widget from Corpus

What's your environment?

  1. By clinking the standalone installation files.
  2. Installation Folder Orange: C:\Program Files\Orange
  3. Installation miconda3 : C:\Users\ASUS\miniconda3 Screenshot 2022-04-19 165813 Screenshot 2022-04-19 170110
ajdapretnar commented 2 years ago

First of all, don't use another Corpus between Preprocess Text and Topic Modelling because you will override all the preprocessing results.

Second, I believe this is the reason for the crash. You did the preprocessing, but overrode it and now the default preprocessing is run, resulting a large number of tokens which don't fit into your RAM. Decrease the number of tokens and try again.

ajdapretnar commented 2 years ago

I've collated all the reports of the same kind. Two other uses say Topic Modelling crashes on Windows (doesn't happen on Mac). One user doesn't even have a large corpus (174 documents and 909 tokens).

We need to research why this happens. @noahnovsak @PrimozGodec

PrimozGodec commented 2 years ago

@djukicn already identified the reason for crashes. I think also the crash described in this issue have the same source.

The problem is that corpus._ngrams_corpus stays defined when tokens are rested (by corpus widget). Similar things happen with subsampling bow features: _ngrams_corpus is subsampled row-wise but stay the same column wise, while dictionary is different. In those cases we would need to reset also subsample ngrams_corpus column vise or remember the dictionary of columns.

Ayway we got the idea for different solutions which can solve all that problems and minimize the probability for errors: we would deprecate _ngrams_corpus and take bag of words counts from the Table if exit (columns with bow attribute flag) in they do not exits topic modeling would compute bow features as it does already (when _ngrams_corpus is not defined).

I think this solution would minimize the probability that something does not work and even give users the option to manipulate with bag-of-words features before the topic modeling.

nadiaelen commented 2 years ago

@djukicn already identified the reason for crashes. I think also the crash described in this issue have the same source.

The problem is that corpus._ngrams_corpus stays defined when tokens are rested (by corpus widget). Similar things happen with subsampling bow features: _ngrams_corpus is subsampled row-wise but stay the same column wise, while dictionary is different. In those cases we would need to reset also subsample ngrams_corpus column vise or remember the dictionary of columns.

Ayway we got the idea for different solutions which can solve all that problems and minimize the probability for errors: we would deprecate _ngrams_corpus and take bag of words counts from the Table if exit (columns with bow attribute flag) in they do not exits topic modeling would compute bow features as it does already (when _ngrams_corpus is not defined).

I think this solution would minimize the probability that something does not work and even give users the option to manipulate with bag-of-words features before the topic modeling.

Hi, I am sorry, perhaps I am in the wrong place, but is there a solution for this? I can't recover my work, orange crashes after the latest update when running topic modeling. I use LDA...

NAsic123 commented 2 years ago

I have the same problem - Topic Modelling crushes when I want to run it.

PrabodhaCha commented 2 years ago

I have the same issue, this is on windows. It crashes even with 100 tweets image

nadiaelen commented 2 years ago

I've collated all the reports of the same kind. Two other uses say Topic Modelling crashes on Windows (doesn't happen on Mac). One user doesn't even have a large corpus (174 documents and 909 tokens).

We need to research why this happens. @noahnovsak @PrimozGodec

Hi, any news on this? Thanks...

ajdapretnar commented 2 years ago

We just released Orange3-Text v. 1.10.0. Please update the add-on and let us know if it works.

If not, we would appreciate if you could provide a workflow, data sample (if possible) and the pip freeze output, if you installed Orange via the terminal.

NAsic123 commented 2 years ago

@ajdapretnar I updated add-on and it still crushes. I am sending additional information bellow.

What's your environment/workflow?

How you installed Orange:

My data sample is here. kurent_neprecisceno.xlsx

Pip freeze output: When I run Topic Modelling it is like this for a few minutes: image

And then it is like: image

And then it says: image

ajdapretnar commented 2 years ago

@NAsic123 What happens it you select to wait?

ajdapretnar commented 2 years ago

Also, I tried it on Mac with your data. I am assuming you are using the default preprocessing and LDA? It works normally for me. @djukicn Any ideas? Could it be a Windows issue?

NAsic123 commented 2 years ago

@ajdapretnar thank you for your answer and help. I will run LDA and I will leave it running and see what will happen. Then I will report what happens. Maybe it needs extra time.

No, I am not using default preprocessing, I am using these preprocessors: image

ajdapretnar commented 2 years ago

Two comments, unrelated to the crashing widget. In preprocess, you don't need Regexp, because tokenization you've set already omits all punctuation. Also, your POS tag filter doesn't do anything, because your data is not tagged, so filtering cannot work.

NAsic123 commented 2 years ago

@ajdapretnar thank you. I will correct it.

And I let the LDA run for one hour and it was still on 0 % and then it crushed.

djukicn commented 2 years ago

@NAsic123 Is you gensim version currenlty 4.2.0? If so, could you please install 4.1.2 and see whether the same problem occurs?

NAsic123 commented 2 years ago

@djukicn I am sorry for the very basic question, but where do I install gensim? I am mostly an R user and not of Python so I'm not that familiar with Python.

In the Orange Command Promt I have to type: C:\Users\amit_>pip install gensim

?

ajdapretnar commented 2 years ago

@NAsic123 Sorry, this was a bit technical. I would be wary of tampering with the set up environment. However, there is a special program called Orange Command Prompt, which you can find from the start menu. Open it and first run pip freeze and post the output here. This will help us identify which version of the gensim library you currently have. Then we can proceed with carefully downgrading and then upgrading again.

NAsic123 commented 2 years ago

@ajdapretnar thank you for your instruction. I ran pip freeze and got this information below (I copied and saved it in .txt). I hope it is useful and thank you again.

pip freeze.txt

ajdapretnar commented 2 years ago

Ok, it does indeed seem like you have gensim==4.2.0. Now please try running pip install gensim==4.1.2. Then open Orange and see it Text works. Please, let me know.

NAsic123 commented 2 years ago

Thank you, I installed it and now it works, Topic Modelling does not crash. But I get this notification (init()_, got un unexpected keyword argument 'random_seed'). Is it maybe that I insert wrong Preprocessing settings? image

My Preprocessing settings: image

ajdapretnar commented 2 years ago

@NAsic123 No, your preprocessing is fine. This one needs to be solved by the core team.

@djukicn It seems like gensim==4.1.2 works. But LSI model has the new random_seed parameter added in version 4.2.0. Do you perhaps have an idea what causes gensim 4.2.0 to no work on Windows?

NAsic123 commented 2 years ago

@ajdapretnar thank you. I ran LDA and now it works. I get the results so not it works. Thank you so much for help. But also with the LDA, I get the same message with random_seed. image

djukicn commented 2 years ago

@ajdapretnar I was actually able reproduce the error (although to me the results were produced after clicking "Wait" a few times) on Ubuntu so it's not just a Windows issue. Somewhere in the background gensim raises an exception. I'll look into it today and see what can be done.

ajdapretnar commented 2 years ago

@djukicn Fantastic! Thanks!

nadiaelen commented 2 years ago

It would be highly appreciated if you could also provide info on updating that gensim library. I understand that's where the issue might be, but I don't know how to update it. Thank you.

ajdapretnar commented 2 years ago

@nadiaelen We haven't been able to identify the real culprit. Once we do, we will prepare the fix and link it to this thread.

nadiaelen commented 2 years ago

@nadiaelen We haven't been able to identify the real culprit. Once we do, we will prepare the fix and link it to this thread.

any news? I am stuck for two months, I would really appreciate some help. Thank you.

ajdapretnar commented 2 years ago

@nadiaelen Unfortunately, no progress so far. We simply cannot consistently reproduce the issue on any machine. Once the update comes, it will be posted in the thread.

nadiaelen commented 2 years ago

any news?

PrimozGodec commented 2 years ago

We have reported the issue to Gensim (the library which computes topics), hope they will consider it soon https://github.com/RaRe-Technologies/gensim/issues/3368

ajdapretnar commented 2 years ago

@nadiaelen If you are on Windows, you could try opening Orange Command Prompt (a separate program available from Start menu). Then enter pip install gensim==4.1.2. Hopefully, this will solve the issue. It's the best I can give at the moment.

Katzengurke commented 2 years ago

Hey, I tried source installation with the merged files from #885 (newest biolab repository), but it didnt work. Orange still crashes. I also tried different versions of orange3 with older add-on versions, didnt work either. Im using Windows 11. Any idea how to fix it?

ajdapretnar commented 2 years ago

@Katzengurke When you say Orange crashes, do you mean the software or the topic modelling widget? Could you perhaps open Orange Command Prompt, run python -m Orange.canvas and try the workflow that results in a crash? Then copy and paste the log here, please.

Katzengurke commented 2 years ago

The software crashes. I tried running the command prompt, but I only get the message "Python stopped working".

Edit: Tried the same on a windows 10 laptop where I didnt temper with any files whatsoever, and it works there. Is it maybe Windows 11?

ajdapretnar commented 2 years ago

Tried the same on a windows 10 laptop where I didnt temper with any files whatsoever, and it works there. Is it maybe Windows 11?

It might be. Does it happen even if you uninstall orange3-text add-on?

ajdapretnar commented 2 years ago

Can you try running python in Orange Command Prompt and let me know the Python version it reports? I'll check if any known Python+Win11 bugs exist.

Katzengurke commented 2 years ago

Tried the same on a windows 10 laptop where I didnt temper with any files whatsoever, and it works there. Is it maybe Windows 11?

It might be. Does it happen even if you uninstall orange3-text add-on?

Yes, sadly. Ill uninstall my Python, my Anaconda and Orange, and then try again with a new installation, and let you know in a couple of minutes.

Can you try running python in Orange Command Prompt and let me know the Python version it reports? I'll check if any known Python+Win11 bugs exist.

Same with that.

ajdapretnar commented 2 years ago

Ok, so it is a Python bug not an Orange bug.

Does running python --version work?

Katzengurke commented 2 years ago

Alright, I got a bit farther, but the bug got stranger too. I uninstalled everything like I said, and reinstalled Orange 3.32 and text from source. Afterwards I tried connecting a corpus widget to topic modelling, but Python stopped working again. I went for python -m Orange.canvas in order to get the log (which works now), but once I tried getting the log it stopped working again, and changed my resolution settings.

Python is 3.8.8

Edit: I actually got a log in the shell, although only for the topic modeler without the connection to the corpus I think

C:\Users\xxx\AppData\Local\Programs\Orange\lib\site-packages\xgboost\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index C:\Users\xxx\AppData\Local\Programs\Orange\lib\site-packages\orangecanvas\scheme\link.py:74: RuntimeWarning: Failed to resolve name 'orangecontrib.network.Network' to a type: ModuleNotFoundError: No module named 'orangecontrib.network'

return tuple(filter(None, resolve_types(types)))

Edit 2: Yep, retried it a couple of times, no chance to get the log of the crash. As soon as I connect corpus to topic modeler, "Python stopped working"

nadiaelen commented 2 years ago

So, the same thing remains: even with the new gensim library, it runs into the same problem, it just hangs forever, after all install, deinstall, etc...:(

Katzengurke commented 2 years ago

So, the same thing remains: even with the new gensim library, it runs into the same problem, it just hangs forever, after all install, deinstall, etc...:(

The fix #885 definitely works under Windows 10, I was able to install it and work with it, without crashes so far. You have to use the source installation with git and the orange command prompt. Doesnt work with my Windows 11 laptop though.

Edit: Nevermind, works now. I have to open Orange via python -m Orange.canvas though. I uninstalled a bunch of programs like node.js though

JosieVor commented 1 year ago

Hey,

I am using orange data mining for my master thesis. However, when I try topic modelling, it crashes. I already read this thread but it was a bit too technical for me to understand since I really do not have any experience with Orange or Python. Can someone help me out?

My environment: Windows 11 Home 64Bit

nadiaelen commented 1 year ago

I really love Orange and appreciate you, your work and everything, but, truly, when it comes to topic modelling, which is the hottest topic right now, one weeks works, one week crashes and stays like that for a month...

PrimozGodec commented 1 year ago

@JosieVor and @nadiaelen, sorry for the late response. I tried to reproduce the error on MacOS and Windows, and it works for me. Can you please give me more information so I can dig deeper into the problem?

Thank you in advance.

JosieVor commented 1 year ago

I am using a Twitter Dataset, which I scrapped directly on Orange. My dataset is quite big (about 30.000 Tweets) but I also have tried topic modeling on smaller datasets (about 100 Tweets, and it still did not work).

I am using LDA, which crashes every time. LSI sometimes works, but more often than not it does not work, either.

Text Add-On Version: 1.12.0

Orange Version: 3.34.0

Gensim Version: the pip freeze command does not work for me, but when I use pip list, it says my version is 4.1.2

calliope212 commented 1 year ago

Hi, I'm using windows 10 and also have the same issues. I already tried the pip freeze using orange command but it still crashed. I also tried to reinstall but still have the same result. Is there any other solution?

Orange version: 3.34.0 Text Add-On version: 1.12.1 Gensim: 4.1.2

PrimozGodec commented 1 year ago

Thank you @JosieVor and @calliope212, for the additional information. We noticed that the newest release didn't support using genim>=4.3.0 (on the master branch we already switched to >=4.3.0).

We fixed the release. Can you please update the Text addon to version 1.12.2 and try again? Please let us know if it helps.

calliope212 commented 1 year ago

Thank you @PrimozGodec for the suggestion. I tried it but unfortunately, it still won't work. When I check the pip list, it says my gensim version is 4.3.1. Is that affect the result?

sohbl commented 1 year ago

Hi, I also have similar issues when I run Topic Modelling, it hangs for more than 2 minutes and I have to kill it ultimately. I'm running on Windows 11. Orange version = 3.34.0 Text Add-on version = 1.13.1 Gensim version = 4.3.0

Another laptop has the same problem. It is running on Windows 10. Orange version = 3.34.0 Text Add-on version = 1.13.1 Gensim version = 4.3.1

Is there any solution for this problem?

IzaClaro commented 1 year ago

Hi! I am also having this problem, whenever I run Topic Modelling widget, the program crashes and ultimatly I have to force shutdown on my computer, as it seems everything fails after! I am running windows 11 as well.

Orange version 3.35 Gensim version = 4.3.0

Do you already have any solution for this problem? Thanks.