Closed haoransh closed 6 years ago
Strange, how are you executing the code? Could you share a small trace so that I can reproduce?
This may be a bug I introduced with recent changes.
Please check the WWW branch https://github.com/flaviovdf/tribeflow/releases/tag/v-paper-www if you want to execute things quickly. I'll get back to you asap
I feel it very strange too, I just follow all the procedures described in the readme. I have tried the WWW branch but encountered similar problem. I have already installed conda as my python environment manager, and my pipeline after I get the source code(WWW branch) is as follows:
conda create -n tribeflow python=2.7
source activate tribeflow
pip install numpy
pip install scipy
pip install cython
pip install pandas
pip install mpi4py
pip install plac
pip install enum34
conda install pytables #if pytables is not installed, pd.HDFStore cannot run correctly.
make
python setup.py install
python scripts/trace_converter.py scripts/test_parser.dat 1 0 2 -d$'\t' -f'%Y-%m-%dT%H:%M:%SZ' > trace.dat
mpiexec -np 20 python main.py trace.dat 100 output.h5 --kernel eccdf --residency_priors 1 99 --dynamic True --leaveout 0.3 --num_iter 2000 --num_batches 20
The error output is
/home/shr/RS/tribeflow-v-paper-www/tribeflow/dynamic.py:164: RuntimeWarning: invalid value encountered in true_divide
Theta_hz = Theta_hz / Theta_hz.sum(axis=0)
/home/shr/RS/tribeflow-v-paper-www/tribeflow/dynamic.py:168: RuntimeWarning: Degrees of freedom <= 0 for slice
C = np.cov(Theta_hz.T) + np.cov(Psi_sz.T)
/home/shr/anaconda2/envs/tribeflow/lib/python2.7/site-packages/numpy/lib/function_base.py:2929: RuntimeWarning: divide by zero encountered in double_scalars
c *= 1. / np.float64(fact)
/home/shr/anaconda2/envs/tribeflow/lib/python2.7/site-packages/numpy/lib/function_base.py:2929: RuntimeWarning: invalid value encountered in multiply
c *= 1. / np.float64(fact)
Traceback (most recent call last):
File "main.py", line 135, in <module>
main()
File "main.py", line 123, in main
args.num_batches, True, from_=from_, to=to)
File "/home/shr/RS/tribeflow-v-paper-www/tribeflow/dynamic.py", line 400, in fit
kernel.update_state(P)
File "tribeflow/kernels/eccdf.pyx", line 71, in tribeflow.kernels.eccdf.ECCDFKernel.update_state (tribeflow/kernels/eccdf.c:2733)
assert P.shape[0] == self.P.shape[0]
AssertionError
I wonder what's wrong with my execution procedures, maybe it's due to some version dismatch between different packages. Here is all the packages installed in the conda-tribeflow environment:
(tribeflow) shr@dlibgpu:~/RS/tribeflow-v-paper-www$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
Cython (0.25.2)
enum34 (1.1.6)
jsonpickle (0.9.4)
mpi4py (2.0.0)
numexpr (2.6.2)
numpy (1.12.1)
pandas (0.19.2)
pip (9.0.1)
plac (0.9.6)
python-dateutil (2.6.0)
pytz (2017.2)
scipy (0.19.0)
setuptools (27.2.0)
six (1.10.0)
tables (3.4.2)
tqdm (4.11.2)
tribeflow (0.0.0)
wheel (0.29.0)
Hope you can reproduce the procedure and meet the same problem successfully. Many thanks for your time!
The problem is with the trace.dat you are generating. When I created the example, I did not pay attention that the file would only have 1 user. Please try it with one of the files in the example folder.
I have tried several example dat files but encounter similar problems:
Here is one excerpt:
mpiexec -np 20 python main.py example/lastfm_our.dat 100 lastfm_ocelma.output.h5 --kernel eccdf --residency_priors 1 99 --dynamic True --leaveout 0.3 --num_iter 2000 --num_batches 20
The printout information is as follows:
Worker 14 has finished it's iterations!
Worker 13 has finished it's iterations!
Split
Merge
/home/shr/RS/tribeflow-v-paper-www/tribeflow/dynamic.py:164: RuntimeWarning: invalid value encountered in true_divide
Theta_hz = Theta_hz / Theta_hz.sum(axis=0)
Traceback (most recent call last):
File "main.py", line 135, in <module>
main()
File "main.py", line 123, in main
args.num_batches, True, from_=from_, to=to)
File "/home/shr/RS/tribeflow-v-paper-www/tribeflow/dynamic.py", line 400, in fit
kernel.update_state(P)
File "tribeflow/kernels/eccdf.pyx", line 71, in tribeflow.kernels.eccdf.ECCDFKernel.update_state (tribeflow/kernels/eccdf.c:2733)
assert P.shape[0] == self.P.shape[0]
AssertionError
Just did
mpiexec -np 3 python main.py ~/example/lastfm_our.dat 10 output.h5 --kernel eccdf --residency_priors 1 99 --dynamic True --leaveout 0.3 --num_iter 2000 --num_batches 20
On both the master and www branches. Not sure what may be your issue.
With 20 workers as in your example
Worker 14 is working! Worker 15 is working! Worker 17 is working! Worker 18 is working! Worker 19 is working! Worker 1 is working! Worker 2 is working! Worker 3 is working! Worker 5 is working! Worker 6 is working! Worker 8 is working! Worker 10 is working! Worker 11 is working! Worker 12 is working! Worker 13 is working! Worker 4 is working! Worker 7 is working! Worker 9 is working! Worker 16 is working! Worker 17 has finished it's iterations! Worker 18 has finished it's iterations! Worker 11 has finished it's iterations! Worker 10 has finished it's iterations! Worker 8 has finished it's iterations! Worker 12 has finished it's iterations! Worker 19 has finished it's iterations! Worker 9 has finished it's iterations! Worker 5 has finished it's iterations! Worker 6 has finished it's iterations! Worker 13 has finished it's iterations! Worker 7 has finished it's iterations! Worker 1 has finished it's iterations! Worker 15 has finished it's iterations! Worker 16 has finished it's iterations! Worker 3 has finished it's iterations! Worker 14 has finished it's iterations! Worker 2 has finished it's iterations! Worker 4 has finished it's iterations! Split Merge Computing probs New nz 10 Learning took 13.0 seconds
The difference lies in the parameter num_topics
. In your command, it is set to 10
, but in the readme file and my command, it's 100
. Could you please set the num_topics
to 10
on WWW branch?
It's very strange that I reset all the environment and now it can work on master branch but still meet Assertion Error
on WWW branch. You can have a try. Now I can run the whole pipeline on master branch. Thank you all the same.
The problem is likely due to environments becoming empty in the sampling. I do not guard against this. 100 topics for a very small trace will end up leaving a lot of envs empty. I'll check the code so that the exception does not happen (sampling finishes)
Many thanks for your time!
I followed all the procedures but to encounter an error:
But It can run smoothly if I remove the configuration
--dynamic True
when runningmain.py
.By the way, it cannot run before I install the pytables manually, with
conda install pytables
. I'm using python 2.7 indeed.I wonder how to solve it correctly? Appreciate it very much if anyone can offer any help. Thanks!