Closed eriknw closed 1 year ago
Context:
This example comes from the draft Matrix Market format document https://math.nist.gov/MatrixMarket/reports/MMformat.ps. The draft suggests parsing tolerance like case insensitivity and that empty lines should be allowed anywhere in the file. The linked matrix is the example they give:
as text:
%%MatrixMarket MATRIX Coordinate Real General
5 5 8
1 1 1.0
2 2 10.5
3 3 1.5e-2
4 4 -2.8E2
5 5 12.
1 4 6
4 2 250.5
4 5 33.32
I'd strongly argue that allowing empty lines is a terrible idea. It's completely unnecessary and complicates the parser for no real gain. It's cheap enough in a sequential parser, but really throws a spanner in the works for a parallel one. I've also yet to see a real .mtx file that has random empty lines (apart from at the end), and frankly nearly all MM parsers out there will choke on that file.
That said, that document is as close as we have to a standard, though that is still a draft and MM in some ways is like CSV with various flavors out there. So for completeness sake I did take a stab at supporting this.
The fix requires more care in handling line numbers. The basic stuff like treating multiple newlines as one is easy.
The tricky part is actually counting lines. The way fast_matrix_market parallelism works is that the file is split into chunks on newline boundaries. Then each chunk gets a line count, that way we know what line number each chunk starts at, and line numbers determine the element number which determines where the value should be written to.
This line count determines the critical path length. The parsing of blocks is then embarrassingly parallel.
Previously we could just use std::count
, which is very fast. But that does not work anymore, since with random empty lines the line number no longer equals element number. Naive methods are about 10x slower than std::count
, but I found methods that are only 2x slower. That does mean the method is fast enough that overall performance does not appear to be affected.
Impressive! I agree this seems strange for a file format, but I'm glad you were able to support it without a noticeable performance hit.
See: https://github.com/scipy/scipy/blob/69e0c474f886990a85a88521cffb8b3978bdfd66/scipy/io/tests/test_mmio.py#L517-L533
scipy.io.mmread
can read this file, butfmm.mmread
raises an error. I don't know which library is "correct", but I thought you would like to know about differences betweenfmm.mmread
andscipy.io.mmread
.Reproducer: