bigartm / bigartm

Fast topic modeling platform
http://bigartm.org/
Other
662 stars 117 forks source link

Get ASCII encoding problem if run .fit_offline #664

Closed ValeraSarapas closed 8 years ago

ValeraSarapas commented 8 years ago
  1. Try to run "Демострация BigARTM (версия 0.8.0).ipynb" from https://www.coursera.org/learn/unsupervised-learning/supplement/suSWG/noutbuk-iz-diemonstratsii-ispol-zovaniia-bigartm
  2. Use artm.version() 0.8.1
  3. Run model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=40)
  4. Get C:\Coursera\Anaconda2\lib\site-packages\protobuf-2.5.1rc0-py2.7.egg\google\protobuf\internal\type_checkers.pyc in CheckValue(self, proposed_value)

126 'encoding. Non-ASCII strings must be converted to ' 127 'unicode objects before being added.' % --> 128 (proposed_value))

ValueError: 'c:\Coursera\week4\school_batches\aaaaaa.batch' has type str, but isn't in 7-bit ASCII encoding. Non-ASCII strings must be converted to unicode objects before being added.

What could be the problem?

ofrei commented 8 years ago

@ValeraSarapas Overall BigARTM require any input to be in UTF-8 encoding. Sometimes it is sufficient to add # -*- coding: utf-8 -*- line on top of your python script. In some cases you need to use str.decode('utf-8') to decode strings before they are passed to BigARTM. At the moment I have no access the ipython notebook from coursera that you've mentioned, so I can't give more precise instructions on how to fix this...

@nadiinchi are you familiar with this problem to help @ValeraSarapas ?

ValeraSarapas commented 8 years ago

@ofrei Thank you for your response. Could you please check if problem can be recreated on the bigartm/bigartm-book/blob/master/applications/multiple_social_networks/FRUCT_workshop.ipynb. That notebook use the same data and I have the same error on it.

ofrei commented 8 years ago

@ValeraSarapas I've tried FRUCT_workshop.ipynb but the error didn't reproduce on my Windows 10 machine. May I ask you to

Also, I'm a bit suspicious about small letter c:\ in c:\Coursera\week4\school_batches\aaaaaa.batch -- are you sure c:\ is written in english (not cycilic)?

ValeraSarapas commented 8 years ago

@ofrei Hi, I tried to start notebook FRUCT_workshop.ipynb in Chrome, I used IE11 on Win7 before. I do not have the ASCII encoding error now, but model_artm stop calculate Perplexity score: model_artm.score_tracker["PerplexityScore"].value [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan] Seems some thing wrong with my installed libraries.

nadiinchi commented 8 years ago

Valera, what is your version of python?

ValeraSarapas commented 8 years ago

I did not change anything in the notebook code. Code for batch_vectorizer :

source_file = os.path.join("data", "lectures.txt") batches_folder = "lectures_batches" if not glob.glob(os.path.join(batches_folder, "*")): batch_vectorizer = artm.BatchVectorizer(data_path=source_file, data_format="vowpal_wabbit", target_folder=batches_folder, batch_size=100) else: batch_vectorizer = artm.BatchVectorizer(data_path=batches_folder, data_format='batches')

Code for the model fit:

model_artm.initialize(dictionary) %time model_artm.fit_offline(batch_vectorizer=batch_vectorizer, \ num_collection_passes=30)

Last few strings from my log:

I1026 19:16:39.892626 12024 cuckoo_watch.h:44] 102ms in ProcessBatch(C:\Coursera\week4\lectures_batches\aaaaak.batch) [including 7ms in LoadMessage; 11ms in InitializeSparseNdw; 68ms in InferThetaAndUpdateNwtSparse; 10ms in CalculateScore(PerplexityScore); 4ms in CalculateScore(SparsityThetaScore); ] I1026 19:16:39.897625 9696 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.897625 7300 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.897625 11612 processor.cc:842] Processor: complete processing batch 659755e4-a74d-470a-8af8-3df4b6716909 into model nwt I1026 19:16:39.898625 11612 cuckoo_watch.h:44] 105ms in ProcessBatch(C:\Coursera\week4\lectures_batches\aaaaam.batch) [including 7ms in LoadMessage; 18ms in InitializeSparseNdw; 66ms in InferThetaAndUpdateNwtSparse; 7ms in CalculateScore(PerplexityScore); 4ms in CalculateScore(SparsityThetaScore); ] I1026 19:16:39.899626 12272 master_component.cc:991] NormalizeModelArgs: pwt_target_name=pwt, nwt_source_name=nwt, rwt_source_name= I1026 19:16:39.899626 12272 master_component.cc:617] MasterComponent: start normalizing model nwt I1026 19:16:39.904626 3892 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.911628 12024 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.917628 11612 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.930629 12272 master_component.cc:639] MasterComponent: complete normalizing model nwt I1026 19:16:39.988634 12272 master_component.cc:1015] DisposeModel rwt I1026 19:16:46.801316 12272 c_interface.cc:180] ArtmCopyRequestedMessage is copying 1560 bytes... I1026 19:16:54.089045 12272 c_interface.cc:180] ArtmCopyRequestedMessage is copying 1560 bytes... I1026 19:24:10.426674 12272 c_interface.cc:180] ArtmCopyRequestedMessage is copying 1560 bytes...

ofrei commented 8 years ago

@ValeraSarapas Great that Chrome fixed that issue with encoding. I didn't know that it could be due to IE browser.

So let's focus on perplexity nan issue. Have you tried to output SparsityThetaScore score, phi matrix, and diagnostics info?

print "Theta sparsity:", model_artm.score_tracker["SparsityThetaScore"].last_value
print model_artm.get_phi()
print model_artm.info
ValeraSarapas commented 8 years ago

@nadiinchi I use Python 2.7.12 |Anaconda custom (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] IPython 5.1.0 -- An enhanced Interactive Python.

ValeraSarapas commented 8 years ago

@ofrei The output of the code is following: `Theta sparsity: 1.0 topic_0 topic_1 topic_2 topic_3 topic_4 topic_5 \ мифичность 0.0 0.0 0.0 0.0 0.0 0.0
консолидировать 0.0 0.0 0.0 0.0 0.0 0.0
расчет 0.0 0.0 0.0 0.0 0.0 0.0
насчет 0.0 0.0 0.0 0.0 0.0 0.0
мочь 0.0 0.0 0.0 0.0 0.0 0.0
отрицательно 0.0 0.0 0.0 0.0 0.0 0.0
сатана 0.0 0.0 0.0 0.0 0.0 0.0
условие 0.0 0.0 0.0 0.0 0.0 0.0
воровство 0.0 0.0 0.0 0.0 0.0 0.0
ликвидность 0.0 0.0 0.0 0.0 0.0 0.0
обезболивающий 0.0 0.0 0.0 0.0 0.0 0.0
трансильвания 0.0 0.0 0.0 0.0 0.0 0.0
развитие 0.0 0.0 0.0 0.0 0.0 0.0
отхожий 0.0 0.0 0.0 0.0 0.0 0.0
каков 0.0 0.0 0.0 0.0 0.0 0.0
градоначальник 0.0 0.0 0.0 0.0 0.0 0.0
торез 0.0 0.0 0.0 0.0 0.0 0.0
лунка 0.0 0.0 0.0 0.0 0.0 0.0
как 0.0 0.0 0.0 0.0 0.0 0.0
биомедицинский 0.0 0.0 0.0 0.0 0.0 0.0
поверенный 0.0 0.0 0.0 0.0 0.0 0.0
клетка 0.0 0.0 0.0 0.0 0.0 0.0
знакомство 0.0 0.0 0.0 0.0 0.0 0.0
воронеж 0.0 0.0 0.0 0.0 0.0 0.0
халиф 0.0 0.0 0.0 0.0 0.0 0.0
систематика 0.0 0.0 0.0 0.0 0.0 0.0
вредный 0.0 0.0 0.0 0.0 0.0 0.0
завещание 0.0 0.0 0.0 0.0 0.0 0.0
конкретность 0.0 0.0 0.0 0.0 0.0 0.0
существование 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
готика 0.0 0.0 0.0 0.0 0.0 0.0
суперсовременный 0.0 0.0 0.0 0.0 0.0 0.0
аннотировать 0.0 0.0 0.0 0.0 0.0 0.0
брауновский 0.0 0.0 0.0 0.0 0.0 0.0
апокалиптический 0.0 0.0 0.0 0.0 0.0 0.0
отпадение 0.0 0.0 0.0 0.0 0.0 0.0
брама 0.0 0.0 0.0 0.0 0.0 0.0
опрыскивать 0.0 0.0 0.0 0.0 0.0 0.0
демпинг 0.0 0.0 0.0 0.0 0.0 0.0
диоклетиан 0.0 0.0 0.0 0.0 0.0 0.0
припаиваться 0.0 0.0 0.0 0.0 0.0 0.0
проводок 0.0 0.0 0.0 0.0 0.0 0.0
обеденный 0.0 0.0 0.0 0.0 0.0 0.0
семидесятый 0.0 0.0 0.0 0.0 0.0 0.0
Павел_Степанцов 0.0 0.0 0.0 0.0 0.0 0.0
обвинитель 0.0 0.0 0.0 0.0 0.0 0.0
боже 0.0 0.0 0.0 0.0 0.0 0.0
Дэвид_Гросс 0.0 0.0 0.0 0.0 0.0 0.0
грешно 0.0 0.0 0.0 0.0 0.0 0.0
непраздный 0.0 0.0 0.0 0.0 0.0 0.0
аммосов 0.0 0.0 0.0 0.0 0.0 0.0
вроцлав 0.0 0.0 0.0 0.0 0.0 0.0
пермь 0.0 0.0 0.0 0.0 0.0 0.0
взрослеть 0.0 0.0 0.0 0.0 0.0 0.0
эрудит 0.0 0.0 0.0 0.0 0.0 0.0
веление 0.0 0.0 0.0 0.0 0.0 0.0
заросль 0.0 0.0 0.0 0.0 0.0 0.0
везучий 0.0 0.0 0.0 0.0 0.0 0.0
обледенение 0.0 0.0 0.0 0.0 0.0 0.0
настораживаться 0.0 0.0 0.0 0.0 0.0 0.0

              topic_6  topic_7  topic_8  topic_9    ...     topic_20  \

мифичность 0.0 0.0 0.0 0.0 ... 0.0
консолидировать 0.0 0.0 0.0 0.0 ... 0.0
расчет 0.0 0.0 0.0 0.0 ... 0.0
насчет 0.0 0.0 0.0 0.0 ... 0.0
мочь 0.0 0.0 0.0 0.0 ... 0.0
отрицательно 0.0 0.0 0.0 0.0 ... 0.0
сатана 0.0 0.0 0.0 0.0 ... 0.0
условие 0.0 0.0 0.0 0.0 ... 0.0
воровство 0.0 0.0 0.0 0.0 ... 0.0
ликвидность 0.0 0.0 0.0 0.0 ... 0.0
обезболивающий 0.0 0.0 0.0 0.0 ... 0.0
трансильвания 0.0 0.0 0.0 0.0 ... 0.0
развитие 0.0 0.0 0.0 0.0 ... 0.0
отхожий 0.0 0.0 0.0 0.0 ... 0.0
каков 0.0 0.0 0.0 0.0 ... 0.0
градоначальник 0.0 0.0 0.0 0.0 ... 0.0
торез 0.0 0.0 0.0 0.0 ... 0.0
лунка 0.0 0.0 0.0 0.0 ... 0.0
как 0.0 0.0 0.0 0.0 ... 0.0
биомедицинский 0.0 0.0 0.0 0.0 ... 0.0
поверенный 0.0 0.0 0.0 0.0 ... 0.0
клетка 0.0 0.0 0.0 0.0 ... 0.0
знакомство 0.0 0.0 0.0 0.0 ... 0.0
воронеж 0.0 0.0 0.0 0.0 ... 0.0
халиф 0.0 0.0 0.0 0.0 ... 0.0
систематика 0.0 0.0 0.0 0.0 ... 0.0
вредный 0.0 0.0 0.0 0.0 ... 0.0
завещание 0.0 0.0 0.0 0.0 ... 0.0
конкретность 0.0 0.0 0.0 0.0 ... 0.0
существование 0.0 0.0 0.0 0.0 ... 0.0
... ... ... ... ... ... ...
готика 0.0 0.0 0.0 0.0 ... 0.0
суперсовременный 0.0 0.0 0.0 0.0 ... 0.0
аннотировать 0.0 0.0 0.0 0.0 ... 0.0
брауновский 0.0 0.0 0.0 0.0 ... 0.0
апокалиптический 0.0 0.0 0.0 0.0 ... 0.0
отпадение 0.0 0.0 0.0 0.0 ... 0.0
брама 0.0 0.0 0.0 0.0 ... 0.0
опрыскивать 0.0 0.0 0.0 0.0 ... 0.0
демпинг 0.0 0.0 0.0 0.0 ... 0.0
диоклетиан 0.0 0.0 0.0 0.0 ... 0.0
припаиваться 0.0 0.0 0.0 0.0 ... 0.0
проводок 0.0 0.0 0.0 0.0 ... 0.0
обеденный 0.0 0.0 0.0 0.0 ... 0.0
семидесятый 0.0 0.0 0.0 0.0 ... 0.0
Павел_Степанцов 0.0 0.0 0.0 0.0 ... 0.0
обвинитель 0.0 0.0 0.0 0.0 ... 0.0
боже 0.0 0.0 0.0 0.0 ... 0.0
Дэвид_Гросс 0.0 0.0 0.0 0.0 ... 0.0
грешно 0.0 0.0 0.0 0.0 ... 0.0
непраздный 0.0 0.0 0.0 0.0 ... 0.0
аммосов 0.0 0.0 0.0 0.0 ... 0.0
вроцлав 0.0 0.0 0.0 0.0 ... 0.0
пермь 0.0 0.0 0.0 0.0 ... 0.0
взрослеть 0.0 0.0 0.0 0.0 ... 0.0
эрудит 0.0 0.0 0.0 0.0 ... 0.0
веление 0.0 0.0 0.0 0.0 ... 0.0
заросль 0.0 0.0 0.0 0.0 ... 0.0
везучий 0.0 0.0 0.0 0.0 ... 0.0
обледенение 0.0 0.0 0.0 0.0 ... 0.0
настораживаться 0.0 0.0 0.0 0.0 ... 0.0

              topic_21  topic_22  topic_23  topic_24  topic_25  topic_26  \

мифичность 0.0 0.0 0.0 0.0 0.0 0.0
консолидировать 0.0 0.0 0.0 0.0 0.0 0.0
расчет 0.0 0.0 0.0 0.0 0.0 0.0
насчет 0.0 0.0 0.0 0.0 0.0 0.0
мочь 0.0 0.0 0.0 0.0 0.0 0.0
отрицательно 0.0 0.0 0.0 0.0 0.0 0.0
сатана 0.0 0.0 0.0 0.0 0.0 0.0
условие 0.0 0.0 0.0 0.0 0.0 0.0
воровство 0.0 0.0 0.0 0.0 0.0 0.0
ликвидность 0.0 0.0 0.0 0.0 0.0 0.0
обезболивающий 0.0 0.0 0.0 0.0 0.0 0.0
трансильвания 0.0 0.0 0.0 0.0 0.0 0.0
развитие 0.0 0.0 0.0 0.0 0.0 0.0
отхожий 0.0 0.0 0.0 0.0 0.0 0.0
каков 0.0 0.0 0.0 0.0 0.0 0.0
градоначальник 0.0 0.0 0.0 0.0 0.0 0.0
торез 0.0 0.0 0.0 0.0 0.0 0.0
лунка 0.0 0.0 0.0 0.0 0.0 0.0
как 0.0 0.0 0.0 0.0 0.0 0.0
биомедицинский 0.0 0.0 0.0 0.0 0.0 0.0
поверенный 0.0 0.0 0.0 0.0 0.0 0.0
клетка 0.0 0.0 0.0 0.0 0.0 0.0
знакомство 0.0 0.0 0.0 0.0 0.0 0.0
воронеж 0.0 0.0 0.0 0.0 0.0 0.0
халиф 0.0 0.0 0.0 0.0 0.0 0.0
систематика 0.0 0.0 0.0 0.0 0.0 0.0
вредный 0.0 0.0 0.0 0.0 0.0 0.0
завещание 0.0 0.0 0.0 0.0 0.0 0.0
конкретность 0.0 0.0 0.0 0.0 0.0 0.0
существование 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
готика 0.0 0.0 0.0 0.0 0.0 0.0
суперсовременный 0.0 0.0 0.0 0.0 0.0 0.0
аннотировать 0.0 0.0 0.0 0.0 0.0 0.0
брауновский 0.0 0.0 0.0 0.0 0.0 0.0
апокалиптический 0.0 0.0 0.0 0.0 0.0 0.0
отпадение 0.0 0.0 0.0 0.0 0.0 0.0
брама 0.0 0.0 0.0 0.0 0.0 0.0
опрыскивать 0.0 0.0 0.0 0.0 0.0 0.0
демпинг 0.0 0.0 0.0 0.0 0.0 0.0
диоклетиан 0.0 0.0 0.0 0.0 0.0 0.0
припаиваться 0.0 0.0 0.0 0.0 0.0 0.0
проводок 0.0 0.0 0.0 0.0 0.0 0.0
обеденный 0.0 0.0 0.0 0.0 0.0 0.0
семидесятый 0.0 0.0 0.0 0.0 0.0 0.0
Павел_Степанцов 0.0 0.0 0.0 0.0 0.0 0.0
обвинитель 0.0 0.0 0.0 0.0 0.0 0.0
боже 0.0 0.0 0.0 0.0 0.0 0.0
Дэвид_Гросс 0.0 0.0 0.0 0.0 0.0 0.0
грешно 0.0 0.0 0.0 0.0 0.0 0.0
непраздный 0.0 0.0 0.0 0.0 0.0 0.0
аммосов 0.0 0.0 0.0 0.0 0.0 0.0
вроцлав 0.0 0.0 0.0 0.0 0.0 0.0
пермь 0.0 0.0 0.0 0.0 0.0 0.0
взрослеть 0.0 0.0 0.0 0.0 0.0 0.0
эрудит 0.0 0.0 0.0 0.0 0.0 0.0
веление 0.0 0.0 0.0 0.0 0.0 0.0
заросль 0.0 0.0 0.0 0.0 0.0 0.0
везучий 0.0 0.0 0.0 0.0 0.0 0.0
обледенение 0.0 0.0 0.0 0.0 0.0 0.0
настораживаться 0.0 0.0 0.0 0.0 0.0 0.0

              topic_27  topic_28  topic_29  

мифичность 0.0 0.0 0.0
консолидировать 0.0 0.0 0.0
расчет 0.0 0.0 0.0
насчет 0.0 0.0 0.0
мочь 0.0 0.0 0.0
отрицательно 0.0 0.0 0.0
сатана 0.0 0.0 0.0
условие 0.0 0.0 0.0
воровство 0.0 0.0 0.0
ликвидность 0.0 0.0 0.0
обезболивающий 0.0 0.0 0.0
трансильвания 0.0 0.0 0.0
развитие 0.0 0.0 0.0
отхожий 0.0 0.0 0.0
каков 0.0 0.0 0.0
градоначальник 0.0 0.0 0.0
торез 0.0 0.0 0.0
лунка 0.0 0.0 0.0
как 0.0 0.0 0.0
биомедицинский 0.0 0.0 0.0
поверенный 0.0 0.0 0.0
клетка 0.0 0.0 0.0
знакомство 0.0 0.0 0.0
воронеж 0.0 0.0 0.0
халиф 0.0 0.0 0.0
систематика 0.0 0.0 0.0
вредный 0.0 0.0 0.0
завещание 0.0 0.0 0.0
конкретность 0.0 0.0 0.0
существование 0.0 0.0 0.0
... ... ... ...
готика 0.0 0.0 0.0
суперсовременный 0.0 0.0 0.0
аннотировать 0.0 0.0 0.0
брауновский 0.0 0.0 0.0
апокалиптический 0.0 0.0 0.0
отпадение 0.0 0.0 0.0
брама 0.0 0.0 0.0
опрыскивать 0.0 0.0 0.0
демпинг 0.0 0.0 0.0
диоклетиан 0.0 0.0 0.0
припаиваться 0.0 0.0 0.0
проводок 0.0 0.0 0.0
обеденный 0.0 0.0 0.0
семидесятый 0.0 0.0 0.0
Павел_Степанцов 0.0 0.0 0.0
обвинитель 0.0 0.0 0.0
боже 0.0 0.0 0.0
Дэвид_Гросс 0.0 0.0 0.0
грешно 0.0 0.0 0.0
непраздный 0.0 0.0 0.0
аммосов 0.0 0.0 0.0
вроцлав 0.0 0.0 0.0
пермь 0.0 0.0 0.0
взрослеть 0.0 0.0 0.0
эрудит 0.0 0.0 0.0
веление 0.0 0.0 0.0
заросль 0.0 0.0 0.0
везучий 0.0 0.0 0.0
обледенение 0.0 0.0 0.0
настораживаться 0.0 0.0 0.0

[26866 rows x 30 columns] config { topic_name: "topic_0" topic_name: "topic_1" topic_name: "topic_2" topic_name: "topic_3" topic_name: "topic_4" topic_name: "topic_5" topic_name: "topic_6" topic_name: "topic_7" topic_name: "topic_8" topic_name: "topic_9" topic_name: "topic_10" topic_name: "topic_11" topic_name: "topic_12" topic_name: "topic_13" topic_name: "topic_14" topic_name: "topic_15" topic_name: "topic_16" topic_name: "topic_17" topic_name: "topic_18" topic_name: "topic_19" topic_name: "topic_20" topic_name: "topic_21" topic_name: "topic_22" topic_name: "topic_23" topic_name: "topic_24" topic_name: "topic_25" topic_name: "topic_26" topic_name: "topic_27" topic_name: "topic_28" topic_name: "topic_29" class_id: "@text" class_id: "@author" class_weight: 1.0 class_weight: 5.0 score_config { name: "PerplexityScore" type: ScoreType_Perplexity config: "" model_name: "pwt" } score_config { name: "SparsityPhiScore" type: ScoreType_SparsityPhi config: "\022\005@text" model_name: "pwt" } score_config { name: "SparsityThetaScore" type: ScoreType_SparsityTheta config: "" model_name: "pwt" } score_config { name: "top_words" type: ScoreType_TopTokens config: "\010\017\022\005@text" model_name: "pwt" } pwt_name: "pwt" nwt_name: "nwt" num_document_passes: 10 reuse_theta: false cache_theta: true } score { name: "PerplexityScore" type: "class artm::score::Perplexity" } score { name: "SparsityPhiScore" type: "class artm::score::SparsityPhi" } score { name: "SparsityThetaScore" type: "class artm::score::SparsityTheta" } score { name: "top_words" type: "class artm::score::TopTokens" } dictionary { name: "4c27a97b-194f-4978-9140-cbe357781674" num_entries: 26866 } dictionary { name: "dictionary" num_entries: 26866 } model { name: "nwt" type: "class artm::core::DensePhiMatrix" num_topics: 30 num_tokens: 26866 } model { name: "pwt" type: "class artm::core::DensePhiMatrix" num_topics: 30 num_tokens: 26866 } cache_entry { key: "0276868a-54c4-40c5-93a1-3efc2ad83f4e" byte_size: 16323 } cache_entry { key: "07296a64-6d0a-4854-8a2b-dbd96e09f253" byte_size: 16319 } cache_entry { key: "0dc6bf52-310f-4ae4-bb85-261094580650" byte_size: 16327 } cache_entry { key: "322ac5a6-9332-4d8c-bbc0-c3059a8604e5" byte_size: 16311 } cache_entry { key: "38b49ed7-07d2-4da0-b22b-2c08984a9bd9" byte_size: 16300 } cache_entry { key: "48fdd43f-28fe-4739-b3e9-e040a19948b4" byte_size: 16230 } cache_entry { key: "49ae595b-70f2-4d81-ae1b-7be51cd6af2b" byte_size: 16328 } cache_entry { key: "4bf7c340-2fd4-4302-82c7-6c59ec9b1040" byte_size: 16307 } cache_entry { key: "4cab0047-1919-49aa-a91a-c60277e65456" byte_size: 16320 } cache_entry { key: "78915d13-ca7e-4bc7-b71e-ff6f2314c450" byte_size: 16320 } cache_entry { key: "85fd85c4-767d-4166-8019-3bf0a7bf530b" byte_size: 16319 } cache_entry { key: "88516cb9-521e-4940-85e6-3eeb03a6745c" byte_size: 4780 } cache_entry { key: "9a9da5e9-eccc-47d9-b0db-e54c525d7b84" byte_size: 16312 } cache_entry { key: "9af19668-54fe-4887-bfe4-55c9f17e46a9" byte_size: 16297 } cache_entry { key: "9f207121-4dea-44e2-b3c0-f9d460f7111c" byte_size: 16307 } cache_entry { key: "ae0f4dcb-382d-45b4-9e36-c989d307352c" byte_size: 16322 } cache_entry { key: "b7d628af-9593-4abd-b895-d333e079ff58" byte_size: 16328 } cache_entry { key: "b8e40c50-3543-4a7e-b4f9-c2b42728ac04" byte_size: 16228 } processor_queue_size: 0 num_processors: 2 `

ofrei commented 8 years ago

@ValeraSarapas Thanks, I see - in this case all values in the phi matrix are zeros. This is not expected, but it at least it explains nan perplexity. Everything else, except for zero phi matrix, appears to be correct.

Since you had issues with unicode my first suspect is these lines from the tutorial:

dict_name = os.path.join(batches_folder, "dict.txt")
dictionary = artm.Dictionary(name="dictionary")
if not os.path.exists(dict_name):
    dictionary.gather(batches_folder)
    dictionary.save_text(dict_name)
else:
    dictionary.load_text(dict_name)

Here the dictionary is gathered from batches, and then saved as text. I'm not sure how robust are save_text and load_text methods -- they are not part of BigARTM core functionality, but rather implemented directly in python API.

Could you please try to replace the code above with

dictionary = artm.Dictionary(name="dictionary")
dictionary.gather(batches_folder)

and let me know if this solved the issue?

ValeraSarapas commented 8 years ago

@ofrei Tried your code. Issue still appears :(

ofrei commented 8 years ago

@ValeraSarapas Thanks for a very useful discussion in this issue! With closer look we've identify several important bugs that I'll fix for v0.8.2:

ofrei commented 8 years ago

Issues listed above are fixed by these commits: https://github.com/bigartm/bigartm/commit/0ccf5dbaa9b892cbfe9243dde458b82b78581ac0 https://github.com/bigartm/bigartm/commit/4b654283c04bb468eb15f8474b08ec8003623759 https://github.com/bigartm/bigartm/commit/a19180c93ae524523d9355221f8821b8e47c350a