Request for mbio/mbinfo test data

schwehr commented 5 years ago

The best test data will already be in a public archive. The more metadata to go with a data sample the better.

https://www3.mbari.org/data/mbsystem/html/mbio.html http://www3.mbari.org/products/mbsystem/formatdoc/index.html https://github.com/dwcaress/MB-System/blob/master/src/mbio/mb_format.h

This is a call for sample data files to help with testing of MB-System. By having coverage of as many formats with as many of the possible packet/message types will let MB-System grow over time without regressing it’s existing ability to read so many formats. Any file donated will assumed to be done so under the MB-System license so that the files or portions of those files can go along with MB-System. These will be a part of MB-System’s unittests and fuzzing infrastructure. Initially, I’ve just setup https://github.com/dwcaress/MB-System/blob/master/test/utilities/mbinfo_test.py with two initial formats covered: mb21/MBF_HSATLRAW and b173 MBF_MGD77TXT. I’ve used mbcopy to create test files for other formats, but those are definitely suboptimal. Eventually, I’d like to have C++ tests that test each packet type on it’s own, but that is for down the road.

Many of these formats are in public archives. I’m happy to use those and could use help from the community finding good examples.

What is most useful initially:

First off, anything logged from a real system is a step forward. So if we have nothing, anything is a win
Smaller is better. Bigger files will be hard to put into git and slow everything down. E.g. for fuzzing, the default is to limit the entire file content to 4K. I often increase that to 20K, but much bigger and things get really slow. Unittests need to finish quickly and more data is slower
To contrast with smaller, we want every type of packet/datagram possible. Especially things like SVP, vessel configs, etc. If a format supports backscatter, sidescan, etc, it would be great to have those.
Different versions of systems. E.g. GSF has had a lot of changes over the years so it would be great to have many versions on hand.

Not all of those files will be used in the unittest. For those that are used we might need to cut down files for size and time reasons before they can be good test inputs.

For later:

Files that cause MB-system to crash. These are great seeds for fuzzing. Eventually, MB-System should try to never crash even on the most corrupt file. But that comes after unittests
Files that can be used to test more complicated utilities in MB-System like gridding and preprocessing
Files that can be used for performance testing. These need to be big enough to be bigger than RAM caches and to minimize the influence of other systems and transients.
Formats that are not yet supported. It would be awesome to have work staged for anyone willing to contribute code for new formats to MB-System. Or packets/datagrams not yet supported

Known sources that people can look into:

https://www.ngdc.noaa.gov/multibeam-survey-search/

schwehr commented 5 years ago

I need to setup https://github.com/schwehr/mbreadsimrad/ to filter down em### files to their smallest size preserving at least one of each datagram/packet type.

dwcaress commented 5 years ago

Kurt, A lot of MB-System failures are only exhibited when working with large amounts of data. I think we will ultimately need to construct a separate test repository with full size data samples - many problems result from the complexities of real datasets in which data records from different pings get mixed or some records are corrupted or some pings produce zero data, etc. Since that is true, we don't have to attempt to achieve comprehensive testing with small files used for unit testing embedded in the primary code archive. Just getting a representative small sample of most formats and checking if each i/o module works at all will achieve a first order goal. Thanks, Dave

schwehr commented 5 years ago

@dwcaress Thanks for the comments. Some clarifications of what I am thinking about for this particular issue. Running large sets through as a less frequent test is a great idea, but I would typically call those integration tests.

Wanring: rambling thoughts while bouncing along on HWY 17 in the mountains follows...

Here I'm aiming for fast and light "unit tests" that can be run for each commit. I think a large fraction of what you are talking about (but definitely not all) can be caught with these simpler small tests combined with fuzzing using pretty small corpus files. Once these tests are in place, we can setup ASAN and MSAN runners along with finishing all of the cppcheck complaints. Perhaps with a side of Coverity (free for open source code) and clang static analyzer. On top of that we can then add fuzzing + asan to really beat up the code. With fuzzers, the component "corpus" files are usually pretty small. With GDAL, I typically keep the under 4Kb, but we can try with larger files. Using a coverage check, I can generate a corpus of files that covers as many of the code paths inside mbsystem as possible. With GDAL (my own copy of gdal with < 5% of the drivers active), that is about 100K files. It might sound like a lot, but it goes pretty quickly to run them all through asan and msan built binaries to find regressions. And I can generate them pretty easily on a 30 core dev desktop over a couple of months of mostly hands off running.

This strategy has worked really well for GDAL and all of it's dependencies. That's caught >7K bugs in GDAL. About 90% of those weren't very interesting with the biggest impact typically being poor error reporting or hinder code analyzers (both in compilers and static analyzers). The results have been that using GDAL has gotten drastically better for the users I support.

I expect this strategy to take quite a while to work through on the existing code and really isn't ever done if people continue contributing to the code base. It just becomes part of the process and mostly automated. e.g. I just got a coverity email about GDAL with another 200 things that it doesn't like.

Only after that would I worry about automated runs of large files. But if someone really wants to setup a periodic runner of large batches of data they should feel free to go for it.

schwehr commented 5 years ago

Format 92, MBF_ELMK2UNB, appears to have issues that surfaced during working on #365 . Getting a sample would be really helpful for debugging. From mbio:

           MBIO Data Format ID:  92
           Format name:          MBF_ELMK2UNB
           Informal Description: Elac BottomChart MkII shallow
                                 water multibeam
           Attributes:           126 beam bathymetry and
                                 amplitude, binary, University
                                 of New Brunswick.

dwcaress / MB-System

Request for mbio/mbinfo test data #319