Closed ValeraSarapas closed 8 years ago
@ValeraSarapas Overall BigARTM require any input to be in UTF-8 encoding. Sometimes it is sufficient to add # -*- coding: utf-8 -*-
line on top of your python script. In some cases you need to use str.decode('utf-8')
to decode strings before they are passed to BigARTM. At the moment I have no access the ipython notebook from coursera that you've mentioned, so I can't give more precise instructions on how to fix this...
@nadiinchi are you familiar with this problem to help @ValeraSarapas ?
@ofrei Thank you for your response. Could you please check if problem can be recreated on the bigartm/bigartm-book/blob/master/applications/multiple_social_networks/FRUCT_workshop.ipynb. That notebook use the same data and I have the same error on it.
@ValeraSarapas I've tried FRUCT_workshop.ipynb
but the error didn't reproduce on my Windows 10 machine. May I ask you to
Also, I'm a bit suspicious about small letter c:\
in c:\Coursera\week4\school_batches\aaaaaa.batch
-- are you sure c:\
is written in english (not cycilic)?
@ofrei Hi,
I tried to start notebook FRUCT_workshop.ipynb in Chrome, I used IE11 on Win7 before. I do not have the ASCII encoding error now, but model_artm stop calculate Perplexity score:
model_artm.score_tracker["PerplexityScore"].value
[nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan]
Seems some thing wrong with my installed libraries.
Valera, what is your version of python?
I did not change anything in the notebook code. Code for batch_vectorizer :
source_file = os.path.join("data", "lectures.txt") batches_folder = "lectures_batches" if not glob.glob(os.path.join(batches_folder, "*")): batch_vectorizer = artm.BatchVectorizer(data_path=source_file, data_format="vowpal_wabbit", target_folder=batches_folder, batch_size=100) else: batch_vectorizer = artm.BatchVectorizer(data_path=batches_folder, data_format='batches')
Code for the model fit:
model_artm.initialize(dictionary) %time model_artm.fit_offline(batch_vectorizer=batch_vectorizer, \ num_collection_passes=30)
Last few strings from my log:
I1026 19:16:39.892626 12024 cuckoo_watch.h:44] 102ms in ProcessBatch(C:\Coursera\week4\lectures_batches\aaaaak.batch) [including 7ms in LoadMessage; 11ms in InitializeSparseNdw; 68ms in InferThetaAndUpdateNwtSparse; 10ms in CalculateScore(PerplexityScore); 4ms in CalculateScore(SparsityThetaScore); ] I1026 19:16:39.897625 9696 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.897625 7300 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.897625 11612 processor.cc:842] Processor: complete processing batch 659755e4-a74d-470a-8af8-3df4b6716909 into model nwt I1026 19:16:39.898625 11612 cuckoo_watch.h:44] 105ms in ProcessBatch(C:\Coursera\week4\lectures_batches\aaaaam.batch) [including 7ms in LoadMessage; 18ms in InitializeSparseNdw; 66ms in InferThetaAndUpdateNwtSparse; 7ms in CalculateScore(PerplexityScore); 4ms in CalculateScore(SparsityThetaScore); ] I1026 19:16:39.899626 12272 master_component.cc:991] NormalizeModelArgs: pwt_target_name=pwt, nwt_source_name=nwt, rwt_source_name= I1026 19:16:39.899626 12272 master_component.cc:617] MasterComponent: start normalizing model nwt I1026 19:16:39.904626 3892 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.911628 12024 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.917628 11612 processor.cc:662] No data in processing queue, waiting... I1026 19:16:39.930629 12272 master_component.cc:639] MasterComponent: complete normalizing model nwt I1026 19:16:39.988634 12272 master_component.cc:1015] DisposeModel rwt I1026 19:16:46.801316 12272 c_interface.cc:180] ArtmCopyRequestedMessage is copying 1560 bytes... I1026 19:16:54.089045 12272 c_interface.cc:180] ArtmCopyRequestedMessage is copying 1560 bytes... I1026 19:24:10.426674 12272 c_interface.cc:180] ArtmCopyRequestedMessage is copying 1560 bytes...
@ValeraSarapas Great that Chrome fixed that issue with encoding. I didn't know that it could be due to IE browser.
So let's focus on perplexity nan
issue. Have you tried to output SparsityThetaScore
score, phi matrix, and diagnostics info?
print "Theta sparsity:", model_artm.score_tracker["SparsityThetaScore"].last_value
print model_artm.get_phi()
print model_artm.info
@nadiinchi I use Python 2.7.12 |Anaconda custom (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] IPython 5.1.0 -- An enhanced Interactive Python.
@ofrei The output of the code is following:
`Theta sparsity: 1.0
topic_0 topic_1 topic_2 topic_3 topic_4 topic_5 \
мифичность 0.0 0.0 0.0 0.0 0.0 0.0
консолидировать 0.0 0.0 0.0 0.0 0.0 0.0
расчет 0.0 0.0 0.0 0.0 0.0 0.0
насчет 0.0 0.0 0.0 0.0 0.0 0.0
мочь 0.0 0.0 0.0 0.0 0.0 0.0
отрицательно 0.0 0.0 0.0 0.0 0.0 0.0
сатана 0.0 0.0 0.0 0.0 0.0 0.0
условие 0.0 0.0 0.0 0.0 0.0 0.0
воровство 0.0 0.0 0.0 0.0 0.0 0.0
ликвидность 0.0 0.0 0.0 0.0 0.0 0.0
обезболивающий 0.0 0.0 0.0 0.0 0.0 0.0
трансильвания 0.0 0.0 0.0 0.0 0.0 0.0
развитие 0.0 0.0 0.0 0.0 0.0 0.0
отхожий 0.0 0.0 0.0 0.0 0.0 0.0
каков 0.0 0.0 0.0 0.0 0.0 0.0
градоначальник 0.0 0.0 0.0 0.0 0.0 0.0
торез 0.0 0.0 0.0 0.0 0.0 0.0
лунка 0.0 0.0 0.0 0.0 0.0 0.0
как 0.0 0.0 0.0 0.0 0.0 0.0
биомедицинский 0.0 0.0 0.0 0.0 0.0 0.0
поверенный 0.0 0.0 0.0 0.0 0.0 0.0
клетка 0.0 0.0 0.0 0.0 0.0 0.0
знакомство 0.0 0.0 0.0 0.0 0.0 0.0
воронеж 0.0 0.0 0.0 0.0 0.0 0.0
халиф 0.0 0.0 0.0 0.0 0.0 0.0
систематика 0.0 0.0 0.0 0.0 0.0 0.0
вредный 0.0 0.0 0.0 0.0 0.0 0.0
завещание 0.0 0.0 0.0 0.0 0.0 0.0
конкретность 0.0 0.0 0.0 0.0 0.0 0.0
существование 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
готика 0.0 0.0 0.0 0.0 0.0 0.0
суперсовременный 0.0 0.0 0.0 0.0 0.0 0.0
аннотировать 0.0 0.0 0.0 0.0 0.0 0.0
брауновский 0.0 0.0 0.0 0.0 0.0 0.0
апокалиптический 0.0 0.0 0.0 0.0 0.0 0.0
отпадение 0.0 0.0 0.0 0.0 0.0 0.0
брама 0.0 0.0 0.0 0.0 0.0 0.0
опрыскивать 0.0 0.0 0.0 0.0 0.0 0.0
демпинг 0.0 0.0 0.0 0.0 0.0 0.0
диоклетиан 0.0 0.0 0.0 0.0 0.0 0.0
припаиваться 0.0 0.0 0.0 0.0 0.0 0.0
проводок 0.0 0.0 0.0 0.0 0.0 0.0
обеденный 0.0 0.0 0.0 0.0 0.0 0.0
семидесятый 0.0 0.0 0.0 0.0 0.0 0.0
Павел_Степанцов 0.0 0.0 0.0 0.0 0.0 0.0
обвинитель 0.0 0.0 0.0 0.0 0.0 0.0
боже 0.0 0.0 0.0 0.0 0.0 0.0
Дэвид_Гросс 0.0 0.0 0.0 0.0 0.0 0.0
грешно 0.0 0.0 0.0 0.0 0.0 0.0
непраздный 0.0 0.0 0.0 0.0 0.0 0.0
аммосов 0.0 0.0 0.0 0.0 0.0 0.0
вроцлав 0.0 0.0 0.0 0.0 0.0 0.0
пермь 0.0 0.0 0.0 0.0 0.0 0.0
взрослеть 0.0 0.0 0.0 0.0 0.0 0.0
эрудит 0.0 0.0 0.0 0.0 0.0 0.0
веление 0.0 0.0 0.0 0.0 0.0 0.0
заросль 0.0 0.0 0.0 0.0 0.0 0.0
везучий 0.0 0.0 0.0 0.0 0.0 0.0
обледенение 0.0 0.0 0.0 0.0 0.0 0.0
настораживаться 0.0 0.0 0.0 0.0 0.0 0.0
topic_6 topic_7 topic_8 topic_9 ... topic_20 \
мифичность 0.0 0.0 0.0 0.0 ... 0.0
консолидировать 0.0 0.0 0.0 0.0 ... 0.0
расчет 0.0 0.0 0.0 0.0 ... 0.0
насчет 0.0 0.0 0.0 0.0 ... 0.0
мочь 0.0 0.0 0.0 0.0 ... 0.0
отрицательно 0.0 0.0 0.0 0.0 ... 0.0
сатана 0.0 0.0 0.0 0.0 ... 0.0
условие 0.0 0.0 0.0 0.0 ... 0.0
воровство 0.0 0.0 0.0 0.0 ... 0.0
ликвидность 0.0 0.0 0.0 0.0 ... 0.0
обезболивающий 0.0 0.0 0.0 0.0 ... 0.0
трансильвания 0.0 0.0 0.0 0.0 ... 0.0
развитие 0.0 0.0 0.0 0.0 ... 0.0
отхожий 0.0 0.0 0.0 0.0 ... 0.0
каков 0.0 0.0 0.0 0.0 ... 0.0
градоначальник 0.0 0.0 0.0 0.0 ... 0.0
торез 0.0 0.0 0.0 0.0 ... 0.0
лунка 0.0 0.0 0.0 0.0 ... 0.0
как 0.0 0.0 0.0 0.0 ... 0.0
биомедицинский 0.0 0.0 0.0 0.0 ... 0.0
поверенный 0.0 0.0 0.0 0.0 ... 0.0
клетка 0.0 0.0 0.0 0.0 ... 0.0
знакомство 0.0 0.0 0.0 0.0 ... 0.0
воронеж 0.0 0.0 0.0 0.0 ... 0.0
халиф 0.0 0.0 0.0 0.0 ... 0.0
систематика 0.0 0.0 0.0 0.0 ... 0.0
вредный 0.0 0.0 0.0 0.0 ... 0.0
завещание 0.0 0.0 0.0 0.0 ... 0.0
конкретность 0.0 0.0 0.0 0.0 ... 0.0
существование 0.0 0.0 0.0 0.0 ... 0.0
... ... ... ... ... ... ...
готика 0.0 0.0 0.0 0.0 ... 0.0
суперсовременный 0.0 0.0 0.0 0.0 ... 0.0
аннотировать 0.0 0.0 0.0 0.0 ... 0.0
брауновский 0.0 0.0 0.0 0.0 ... 0.0
апокалиптический 0.0 0.0 0.0 0.0 ... 0.0
отпадение 0.0 0.0 0.0 0.0 ... 0.0
брама 0.0 0.0 0.0 0.0 ... 0.0
опрыскивать 0.0 0.0 0.0 0.0 ... 0.0
демпинг 0.0 0.0 0.0 0.0 ... 0.0
диоклетиан 0.0 0.0 0.0 0.0 ... 0.0
припаиваться 0.0 0.0 0.0 0.0 ... 0.0
проводок 0.0 0.0 0.0 0.0 ... 0.0
обеденный 0.0 0.0 0.0 0.0 ... 0.0
семидесятый 0.0 0.0 0.0 0.0 ... 0.0
Павел_Степанцов 0.0 0.0 0.0 0.0 ... 0.0
обвинитель 0.0 0.0 0.0 0.0 ... 0.0
боже 0.0 0.0 0.0 0.0 ... 0.0
Дэвид_Гросс 0.0 0.0 0.0 0.0 ... 0.0
грешно 0.0 0.0 0.0 0.0 ... 0.0
непраздный 0.0 0.0 0.0 0.0 ... 0.0
аммосов 0.0 0.0 0.0 0.0 ... 0.0
вроцлав 0.0 0.0 0.0 0.0 ... 0.0
пермь 0.0 0.0 0.0 0.0 ... 0.0
взрослеть 0.0 0.0 0.0 0.0 ... 0.0
эрудит 0.0 0.0 0.0 0.0 ... 0.0
веление 0.0 0.0 0.0 0.0 ... 0.0
заросль 0.0 0.0 0.0 0.0 ... 0.0
везучий 0.0 0.0 0.0 0.0 ... 0.0
обледенение 0.0 0.0 0.0 0.0 ... 0.0
настораживаться 0.0 0.0 0.0 0.0 ... 0.0
topic_21 topic_22 topic_23 topic_24 topic_25 topic_26 \
мифичность 0.0 0.0 0.0 0.0 0.0 0.0
консолидировать 0.0 0.0 0.0 0.0 0.0 0.0
расчет 0.0 0.0 0.0 0.0 0.0 0.0
насчет 0.0 0.0 0.0 0.0 0.0 0.0
мочь 0.0 0.0 0.0 0.0 0.0 0.0
отрицательно 0.0 0.0 0.0 0.0 0.0 0.0
сатана 0.0 0.0 0.0 0.0 0.0 0.0
условие 0.0 0.0 0.0 0.0 0.0 0.0
воровство 0.0 0.0 0.0 0.0 0.0 0.0
ликвидность 0.0 0.0 0.0 0.0 0.0 0.0
обезболивающий 0.0 0.0 0.0 0.0 0.0 0.0
трансильвания 0.0 0.0 0.0 0.0 0.0 0.0
развитие 0.0 0.0 0.0 0.0 0.0 0.0
отхожий 0.0 0.0 0.0 0.0 0.0 0.0
каков 0.0 0.0 0.0 0.0 0.0 0.0
градоначальник 0.0 0.0 0.0 0.0 0.0 0.0
торез 0.0 0.0 0.0 0.0 0.0 0.0
лунка 0.0 0.0 0.0 0.0 0.0 0.0
как 0.0 0.0 0.0 0.0 0.0 0.0
биомедицинский 0.0 0.0 0.0 0.0 0.0 0.0
поверенный 0.0 0.0 0.0 0.0 0.0 0.0
клетка 0.0 0.0 0.0 0.0 0.0 0.0
знакомство 0.0 0.0 0.0 0.0 0.0 0.0
воронеж 0.0 0.0 0.0 0.0 0.0 0.0
халиф 0.0 0.0 0.0 0.0 0.0 0.0
систематика 0.0 0.0 0.0 0.0 0.0 0.0
вредный 0.0 0.0 0.0 0.0 0.0 0.0
завещание 0.0 0.0 0.0 0.0 0.0 0.0
конкретность 0.0 0.0 0.0 0.0 0.0 0.0
существование 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
готика 0.0 0.0 0.0 0.0 0.0 0.0
суперсовременный 0.0 0.0 0.0 0.0 0.0 0.0
аннотировать 0.0 0.0 0.0 0.0 0.0 0.0
брауновский 0.0 0.0 0.0 0.0 0.0 0.0
апокалиптический 0.0 0.0 0.0 0.0 0.0 0.0
отпадение 0.0 0.0 0.0 0.0 0.0 0.0
брама 0.0 0.0 0.0 0.0 0.0 0.0
опрыскивать 0.0 0.0 0.0 0.0 0.0 0.0
демпинг 0.0 0.0 0.0 0.0 0.0 0.0
диоклетиан 0.0 0.0 0.0 0.0 0.0 0.0
припаиваться 0.0 0.0 0.0 0.0 0.0 0.0
проводок 0.0 0.0 0.0 0.0 0.0 0.0
обеденный 0.0 0.0 0.0 0.0 0.0 0.0
семидесятый 0.0 0.0 0.0 0.0 0.0 0.0
Павел_Степанцов 0.0 0.0 0.0 0.0 0.0 0.0
обвинитель 0.0 0.0 0.0 0.0 0.0 0.0
боже 0.0 0.0 0.0 0.0 0.0 0.0
Дэвид_Гросс 0.0 0.0 0.0 0.0 0.0 0.0
грешно 0.0 0.0 0.0 0.0 0.0 0.0
непраздный 0.0 0.0 0.0 0.0 0.0 0.0
аммосов 0.0 0.0 0.0 0.0 0.0 0.0
вроцлав 0.0 0.0 0.0 0.0 0.0 0.0
пермь 0.0 0.0 0.0 0.0 0.0 0.0
взрослеть 0.0 0.0 0.0 0.0 0.0 0.0
эрудит 0.0 0.0 0.0 0.0 0.0 0.0
веление 0.0 0.0 0.0 0.0 0.0 0.0
заросль 0.0 0.0 0.0 0.0 0.0 0.0
везучий 0.0 0.0 0.0 0.0 0.0 0.0
обледенение 0.0 0.0 0.0 0.0 0.0 0.0
настораживаться 0.0 0.0 0.0 0.0 0.0 0.0
topic_27 topic_28 topic_29
мифичность 0.0 0.0 0.0
консолидировать 0.0 0.0 0.0
расчет 0.0 0.0 0.0
насчет 0.0 0.0 0.0
мочь 0.0 0.0 0.0
отрицательно 0.0 0.0 0.0
сатана 0.0 0.0 0.0
условие 0.0 0.0 0.0
воровство 0.0 0.0 0.0
ликвидность 0.0 0.0 0.0
обезболивающий 0.0 0.0 0.0
трансильвания 0.0 0.0 0.0
развитие 0.0 0.0 0.0
отхожий 0.0 0.0 0.0
каков 0.0 0.0 0.0
градоначальник 0.0 0.0 0.0
торез 0.0 0.0 0.0
лунка 0.0 0.0 0.0
как 0.0 0.0 0.0
биомедицинский 0.0 0.0 0.0
поверенный 0.0 0.0 0.0
клетка 0.0 0.0 0.0
знакомство 0.0 0.0 0.0
воронеж 0.0 0.0 0.0
халиф 0.0 0.0 0.0
систематика 0.0 0.0 0.0
вредный 0.0 0.0 0.0
завещание 0.0 0.0 0.0
конкретность 0.0 0.0 0.0
существование 0.0 0.0 0.0
... ... ... ...
готика 0.0 0.0 0.0
суперсовременный 0.0 0.0 0.0
аннотировать 0.0 0.0 0.0
брауновский 0.0 0.0 0.0
апокалиптический 0.0 0.0 0.0
отпадение 0.0 0.0 0.0
брама 0.0 0.0 0.0
опрыскивать 0.0 0.0 0.0
демпинг 0.0 0.0 0.0
диоклетиан 0.0 0.0 0.0
припаиваться 0.0 0.0 0.0
проводок 0.0 0.0 0.0
обеденный 0.0 0.0 0.0
семидесятый 0.0 0.0 0.0
Павел_Степанцов 0.0 0.0 0.0
обвинитель 0.0 0.0 0.0
боже 0.0 0.0 0.0
Дэвид_Гросс 0.0 0.0 0.0
грешно 0.0 0.0 0.0
непраздный 0.0 0.0 0.0
аммосов 0.0 0.0 0.0
вроцлав 0.0 0.0 0.0
пермь 0.0 0.0 0.0
взрослеть 0.0 0.0 0.0
эрудит 0.0 0.0 0.0
веление 0.0 0.0 0.0
заросль 0.0 0.0 0.0
везучий 0.0 0.0 0.0
обледенение 0.0 0.0 0.0
настораживаться 0.0 0.0 0.0
[26866 rows x 30 columns] config { topic_name: "topic_0" topic_name: "topic_1" topic_name: "topic_2" topic_name: "topic_3" topic_name: "topic_4" topic_name: "topic_5" topic_name: "topic_6" topic_name: "topic_7" topic_name: "topic_8" topic_name: "topic_9" topic_name: "topic_10" topic_name: "topic_11" topic_name: "topic_12" topic_name: "topic_13" topic_name: "topic_14" topic_name: "topic_15" topic_name: "topic_16" topic_name: "topic_17" topic_name: "topic_18" topic_name: "topic_19" topic_name: "topic_20" topic_name: "topic_21" topic_name: "topic_22" topic_name: "topic_23" topic_name: "topic_24" topic_name: "topic_25" topic_name: "topic_26" topic_name: "topic_27" topic_name: "topic_28" topic_name: "topic_29" class_id: "@text" class_id: "@author" class_weight: 1.0 class_weight: 5.0 score_config { name: "PerplexityScore" type: ScoreType_Perplexity config: "" model_name: "pwt" } score_config { name: "SparsityPhiScore" type: ScoreType_SparsityPhi config: "\022\005@text" model_name: "pwt" } score_config { name: "SparsityThetaScore" type: ScoreType_SparsityTheta config: "" model_name: "pwt" } score_config { name: "top_words" type: ScoreType_TopTokens config: "\010\017\022\005@text" model_name: "pwt" } pwt_name: "pwt" nwt_name: "nwt" num_document_passes: 10 reuse_theta: false cache_theta: true } score { name: "PerplexityScore" type: "class artm::score::Perplexity" } score { name: "SparsityPhiScore" type: "class artm::score::SparsityPhi" } score { name: "SparsityThetaScore" type: "class artm::score::SparsityTheta" } score { name: "top_words" type: "class artm::score::TopTokens" } dictionary { name: "4c27a97b-194f-4978-9140-cbe357781674" num_entries: 26866 } dictionary { name: "dictionary" num_entries: 26866 } model { name: "nwt" type: "class artm::core::DensePhiMatrix" num_topics: 30 num_tokens: 26866 } model { name: "pwt" type: "class artm::core::DensePhiMatrix" num_topics: 30 num_tokens: 26866 } cache_entry { key: "0276868a-54c4-40c5-93a1-3efc2ad83f4e" byte_size: 16323 } cache_entry { key: "07296a64-6d0a-4854-8a2b-dbd96e09f253" byte_size: 16319 } cache_entry { key: "0dc6bf52-310f-4ae4-bb85-261094580650" byte_size: 16327 } cache_entry { key: "322ac5a6-9332-4d8c-bbc0-c3059a8604e5" byte_size: 16311 } cache_entry { key: "38b49ed7-07d2-4da0-b22b-2c08984a9bd9" byte_size: 16300 } cache_entry { key: "48fdd43f-28fe-4739-b3e9-e040a19948b4" byte_size: 16230 } cache_entry { key: "49ae595b-70f2-4d81-ae1b-7be51cd6af2b" byte_size: 16328 } cache_entry { key: "4bf7c340-2fd4-4302-82c7-6c59ec9b1040" byte_size: 16307 } cache_entry { key: "4cab0047-1919-49aa-a91a-c60277e65456" byte_size: 16320 } cache_entry { key: "78915d13-ca7e-4bc7-b71e-ff6f2314c450" byte_size: 16320 } cache_entry { key: "85fd85c4-767d-4166-8019-3bf0a7bf530b" byte_size: 16319 } cache_entry { key: "88516cb9-521e-4940-85e6-3eeb03a6745c" byte_size: 4780 } cache_entry { key: "9a9da5e9-eccc-47d9-b0db-e54c525d7b84" byte_size: 16312 } cache_entry { key: "9af19668-54fe-4887-bfe4-55c9f17e46a9" byte_size: 16297 } cache_entry { key: "9f207121-4dea-44e2-b3c0-f9d460f7111c" byte_size: 16307 } cache_entry { key: "ae0f4dcb-382d-45b4-9e36-c989d307352c" byte_size: 16322 } cache_entry { key: "b7d628af-9593-4abd-b895-d333e079ff58" byte_size: 16328 } cache_entry { key: "b8e40c50-3543-4a7e-b4f9-c2b42728ac04" byte_size: 16228 } processor_queue_size: 0 num_processors: 2 `
@ValeraSarapas Thanks, I see - in this case all values in the phi matrix are zeros. This is not expected, but it at least it explains nan
perplexity. Everything else, except for zero phi matrix, appears to be correct.
Since you had issues with unicode my first suspect is these lines from the tutorial:
dict_name = os.path.join(batches_folder, "dict.txt")
dictionary = artm.Dictionary(name="dictionary")
if not os.path.exists(dict_name):
dictionary.gather(batches_folder)
dictionary.save_text(dict_name)
else:
dictionary.load_text(dict_name)
Here the dictionary is gathered from batches, and then saved as text. I'm not sure how robust are save_text
and load_text
methods -- they are not part of BigARTM core functionality, but rather implemented directly in python API.
Could you please try to replace the code above with
dictionary = artm.Dictionary(name="dictionary")
dictionary.gather(batches_folder)
and let me know if this solved the issue?
@ofrei Tried your code. Issue still appears :(
@ValeraSarapas Thanks for a very useful discussion in this issue! With closer look we've identify several important bugs that I'll fix for v0.8.2
:
os.path.abspath
from hereinitialize()
, when BigARTM creates model from a dictionary, it must use only tokens of specified modalities. Currently there is an issue - all tokens from the dictionary will be included in the model, even if they are not part of model.class_id
. If the resulting set of tokens is 0 then initialize()
must raise an exception.fit_offline()
we need to raise an error if the data was effectively empty. E.g. if there are no batches, or all batche have 0 items, or all items have 0 tokens, or all tokens are of irrelevant modalities. This is already tracked by ticket #500, but it hasn't been fixed yet.
126 'encoding. Non-ASCII strings must be converted to ' 127 'unicode objects before being added.' % --> 128 (proposed_value))
ValueError: 'c:\Coursera\week4\school_batches\aaaaaa.batch' has type str, but isn't in 7-bit ASCII encoding. Non-ASCII strings must be converted to unicode objects before being added.
What could be the problem?