GraphChi / graphchi-cpp

GraphChi's C++ version. Big Data - small machine.
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf
800 stars 311 forks source link

Unconventional MatrixMarket format #25

Open fenekku opened 10 years ago

fenekku commented 10 years ago

There are a number of output files obtained from running collaborative filtering algorithms found in toolkits/collaborative_filtering that advertise themselves to be MatrixMarket files through a .mm extension or a %%MatrixMarket matrix array real general header, but do not seem to follow the expected MatrixMarket format as defined by NIST.

For example, the output of running ./toolkits/collaborative_filtering/rating --training=smallnetflix_mm --num_ratings=5 --quiet=1 --algorithm=als is two files:

Their header is (only one is shown here):

$ head -n 10 smallnetflix_mm.ids
%%MatrixMarket matrix array real general 
%This file contains item ids matching the ratings. In each row i, num_ratings top item ids for user i. (First column: user id, next columns, top K ratings). Note: 0 item id means there are no more items to recommend for this user.
95526 6 
1 1243 424 2641 2109 1557
2 2641 1548 1227 548 76 
3 1243 2548 1227 2641 76 
4 1449 2641 2109 3172 1227 
5 1449 1227 2298 735 1382 
6 2109 2669 1227 3112 2583
7 3516 2016 2647 1548 1243 

'array' here indicates to the parser that the output is expected to be one value per line (column-oriented), yet it is not the case. Other files with the same problem include files ending in _U.mm or _V.mm.

This problem is especially apparent when using mmread from scipy.io (Python third-party way of reading matrixmarket files) to read these files as the format is then perceived as invalid and the file can't be read. (The --R_output_format option is not changing any of that for me).

I might be missing something here though. Thanks for the tool :).

zachmayer commented 10 years ago

Theres a similar issue with inputs, particular for the gensgd program

meteotester commented 9 years ago

More about this unconventional MatrixMarket format: https://github.com/GraphChi/graphchi-cpp/issues/9