alugowski / fast_matrix_market

Fast and full-featured Matrix Market I/O library for C++, Python, and R
BSD 2-Clause "Simplified" License
75 stars 7 forks source link

Is this a valid Matrix Market format? #10

Closed eriknw closed 1 year ago

eriknw commented 1 year ago

See: https://github.com/scipy/scipy/blob/69e0c474f886990a85a88521cffb8b3978bdfd66/scipy/io/tests/test_mmio.py#L517-L533

scipy.io.mmread can read this file, but fmm.mmread raises an error. I don't know which library is "correct", but I thought you would like to know about differences between fmm.mmread and scipy.io.mmread.

Reproducer:

import fast_matrix_market as fmm
import scipy
from io import StringIO
from scipy.io.tests import test_mmio

A1 = scipy.io.mmread(StringIO(test_mmio._empty_lines_example))  # works
A2 = fmm.mmread(StringIO(test_mmio._empty_lines_example))  # ValueError: Line 3: Too many lines in file (file too long)
alugowski commented 1 year ago

Context:

This example comes from the draft Matrix Market format document https://math.nist.gov/MatrixMarket/reports/MMformat.ps. The draft suggests parsing tolerance like case insensitivity and that empty lines should be allowed anywhere in the file. The linked matrix is the example they give: image

as text:

%%MatrixMarket  MATRIX    Coordinate    Real General

   5  5         8

1 1  1.0
2 2       10.5
3 3             1.5e-2
4 4                     -2.8E2
5 5                              12.
     1      4      6
     4      2      250.5
     4      5      33.32

I'd strongly argue that allowing empty lines is a terrible idea. It's completely unnecessary and complicates the parser for no real gain. It's cheap enough in a sequential parser, but really throws a spanner in the works for a parallel one. I've also yet to see a real .mtx file that has random empty lines (apart from at the end), and frankly nearly all MM parsers out there will choke on that file.

That said, that document is as close as we have to a standard, though that is still a draft and MM in some ways is like CSV with various flavors out there. So for completeness sake I did take a stab at supporting this.

alugowski commented 1 year ago

The fix requires more care in handling line numbers. The basic stuff like treating multiple newlines as one is easy.

The tricky part is actually counting lines. The way fast_matrix_market parallelism works is that the file is split into chunks on newline boundaries. Then each chunk gets a line count, that way we know what line number each chunk starts at, and line numbers determine the element number which determines where the value should be written to.

This line count determines the critical path length. The parsing of blocks is then embarrassingly parallel.

Previously we could just use std::count, which is very fast. But that does not work anymore, since with random empty lines the line number no longer equals element number. Naive methods are about 10x slower than std::count, but I found methods that are only 2x slower. That does mean the method is fast enough that overall performance does not appear to be affected.

eriknw commented 1 year ago

Impressive! I agree this seems strange for a file format, but I'm glad you were able to support it without a noticeable performance hit.