dwcaress / MB-System

MB-System is an open source software package for the processing and display of bathymetry and backscatter imagery data derived from multibeam, interferometry, and sidescan sonars.
https://www.mbari.org/products/research-software/mb-system/
Other
124 stars 42 forks source link

Request for mbio/mbinfo test data #319

Open schwehr opened 5 years ago

schwehr commented 5 years ago

The best test data will already be in a public archive. The more metadata to go with a data sample the better.

https://www3.mbari.org/data/mbsystem/html/mbio.html http://www3.mbari.org/products/mbsystem/formatdoc/index.html https://github.com/dwcaress/MB-System/blob/master/src/mbio/mb_format.h

This is a call for sample data files to help with testing of MB-System. By having coverage of as many formats with as many of the possible packet/message types will let MB-System grow over time without regressing it’s existing ability to read so many formats. Any file donated will assumed to be done so under the MB-System license so that the files or portions of those files can go along with MB-System. These will be a part of MB-System’s unittests and fuzzing infrastructure. Initially, I’ve just setup https://github.com/dwcaress/MB-System/blob/master/test/utilities/mbinfo_test.py with two initial formats covered: mb21/MBF_HSATLRAW and b173 MBF_MGD77TXT. I’ve used mbcopy to create test files for other formats, but those are definitely suboptimal. Eventually, I’d like to have C++ tests that test each packet type on it’s own, but that is for down the road.

Many of these formats are in public archives. I’m happy to use those and could use help from the community finding good examples.

What is most useful initially:

Not all of those files will be used in the unittest. For those that are used we might need to cut down files for size and time reasons before they can be good test inputs.

For later:

Known sources that people can look into:

https://www.ngdc.noaa.gov/multibeam-survey-search/

schwehr commented 5 years ago

I need to setup https://github.com/schwehr/mbreadsimrad/ to filter down em### files to their smallest size preserving at least one of each datagram/packet type.

dwcaress commented 5 years ago

Kurt, A lot of MB-System failures are only exhibited when working with large amounts of data. I think we will ultimately need to construct a separate test repository with full size data samples - many problems result from the complexities of real datasets in which data records from different pings get mixed or some records are corrupted or some pings produce zero data, etc. Since that is true, we don't have to attempt to achieve comprehensive testing with small files used for unit testing embedded in the primary code archive. Just getting a representative small sample of most formats and checking if each i/o module works at all will achieve a first order goal. Thanks, Dave

schwehr commented 5 years ago

@dwcaress Thanks for the comments. Some clarifications of what I am thinking about for this particular issue. Running large sets through as a less frequent test is a great idea, but I would typically call those integration tests.

Wanring: rambling thoughts while bouncing along on HWY 17 in the mountains follows...

Here I'm aiming for fast and light "unit tests" that can be run for each commit. I think a large fraction of what you are talking about (but definitely not all) can be caught with these simpler small tests combined with fuzzing using pretty small corpus files. Once these tests are in place, we can setup ASAN and MSAN runners along with finishing all of the cppcheck complaints. Perhaps with a side of Coverity (free for open source code) and clang static analyzer. On top of that we can then add fuzzing + asan to really beat up the code. With fuzzers, the component "corpus" files are usually pretty small. With GDAL, I typically keep the under 4Kb, but we can try with larger files. Using a coverage check, I can generate a corpus of files that covers as many of the code paths inside mbsystem as possible. With GDAL (my own copy of gdal with < 5% of the drivers active), that is about 100K files. It might sound like a lot, but it goes pretty quickly to run them all through asan and msan built binaries to find regressions. And I can generate them pretty easily on a 30 core dev desktop over a couple of months of mostly hands off running.

This strategy has worked really well for GDAL and all of it's dependencies. That's caught >7K bugs in GDAL. About 90% of those weren't very interesting with the biggest impact typically being poor error reporting or hinder code analyzers (both in compilers and static analyzers). The results have been that using GDAL has gotten drastically better for the users I support.

I expect this strategy to take quite a while to work through on the existing code and really isn't ever done if people continue contributing to the code base. It just becomes part of the process and mostly automated. e.g. I just got a coverity email about GDAL with another 200 things that it doesn't like.

Only after that would I worry about automated runs of large files. But if someone really wants to setup a periodic runner of large batches of data they should feel free to go for it.

schwehr commented 5 years ago

Format 92, MBF_ELMK2UNB, appears to have issues that surfaced during working on #365 . Getting a sample would be really helpful for debugging. From mbio:

           MBIO Data Format ID:  92
           Format name:          MBF_ELMK2UNB
           Informal Description: Elac BottomChart MkII shallow
                                 water multibeam
           Attributes:           126 beam bathymetry and
                                 amplitude, binary, University
                                 of New Brunswick.