alicia-ziying-yang / conTEXT-explorer

ConTEXT Explorer is an open Web-based system for exploring and visualizing concepts (combinations of occurring words and phrases) over time in the text documents.
Apache License 2.0
9 stars 3 forks source link

Problem uploading custom dataset - (RuntimeError: you must first build vocabulary before training the model) #19

Closed baileythegreen closed 2 years ago

baileythegreen commented 2 years ago

I am getting the following error when I try to upload a custom dataset (attached) pride_chapters.csv to the dashboard. Everything appears to work: I hit 'upload', it does the spinny-thinking thing, and then instead of a new tab opening, it just goes back to the upload page like nothing happened.

In case this was caused by something on my computer, I moved to a different one and started from scratch in a new clone of the repo. I get the same thing there.

JOSS Reference: openjournals/joss-reviews#3347

127.0.0.1 - - [15/Nov/2021 11:38:55] "POST /_dash-update-component HTTP/1.1" 500 -
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/baileythegreen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
1. Converting document to words for /selected_content_pride_chapters.pkl ... 2021-11-15 11:39:08
Converting doc to word time: 9.298324584960938e-06
2. Building the bigram model for /selected_content_pride_chapters.pkl ... 2021-11-15 11:39:08
Building Bigram: 0.00036025047302246094
3. Building the bigram model for /selected_content_pride_chapters.pkl ... 2021-11-15 11:39:08
Building Bigram Model: 2.09808349609375e-05
4. Removing stop words for /selected_content_pride_chapters.pkl ... 2021-11-15 11:39:08
Time spent on removing stopwords: 3.814697265625e-06
5. Forming bigrams for /selected_content_pride_chapters.pkl ... 2021-11-15 11:39:08
Time spent on forming bigrams: 2.1457672119140625e-06
6. Lemmatizing /selected_content_pride_chapters.pkl ... 2021-11-15 11:39:08
Time spent on lemmatizing: 3.0994415283203125e-06
7. Writing into pickle... 2021-11-15 11:39:08
Total process time for one document 0.1347498893737793 2021-11-15 11:39:08
Start Reading: 2021-11-15 11:39:08
0
Read time: 0.11266112327575684
Start Reading: 2021-11-15 11:39:08
length of data: 0 ; length of corpus 0
Shape of the corpus in this iteration: (1, 1)
Total time: 0.0015909671783447266
Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/dash/dash.py", line 1050, in dispatch
    response.set_data(func(*args, outputs_list=outputs_list))
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/dash/dash.py", line 985, in add_context
    output_value = func(*args, **kwargs)  # %% callback invoked %%
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/app.py", line 2346, in uploading
    get3 = word2vec.train_model(corpus_name)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/topic_model/word2vec.py", line 22, in train_model
    iter=20  # number of iterations over the corpus
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
    fast_version=FAST_VERSION)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 763, in __init__
    end_alpha=self.min_alpha, compute_loss=compute_loss)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/gensim/models/word2vec.py", line 910, in train
    queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1081, in train
    **kwargs)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 536, in train
    total_words=total_words, **kwargs)
  File "/Users/baileythegreen/Software/miniconda3/envs/ce-env/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
    raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
127.0.0.1 - - [15/Nov/2021 11:39:08] "POST /_dash-update-component HTTP/1.1" 500 -
alicia-ziying-yang commented 2 years ago

Hi @baileythegreen ,

I used your .csv, and it worked for me well without any error. But I am trying to figure out why you got those errors. Could you please check:

I saw you got length of data: 0 ; length of corpus 0 in your log, which means that the app read nothing. But in my log, I got length of data: 5 ; length of corpus 5 (using your dataset). The runtime error actually says the same thing: it has nothing to build vocabulary.

May I know what you set in Step 2 when uploading? What did you choose for the "sentences"? This is my setting:

Screen Shot 2021-11-16 at 1 21 31 pm

Could you please try again using this setting?

alicia-ziying-yang commented 2 years ago

I have your dataset uploaded ok with the setting above. And in step 3, the name of the corpus is "pride_chapters". @baileythegreen

Screen Shot 2021-11-16 at 1 24 11 pm
baileythegreen commented 2 years ago

Okay, I think I must not have been paying enough attention yesterday. I registered that the Year field was looking at the Year column, but the ID column was not looking at any columns. I was setting that one to Chapter, but apparently the Sentences field was escaping my notice.

Some kind of visible warning, error message, pop-up, et cetera in the dashboard might be helpful for users—for this, or other issues. I know to look at the terminal window, but your users might not.

Anyway, everything is working now and I feel a bit silly. The new installation process is very nice, though!