Test file generation - Githubissues

kain88-de commented 9 years ago

Original discussion at #474

tl;dr

I think it is OK to create synthetic test-files with MDAnalysis when we use different tools to check their content. @orbeckst thinks we should create the files already with different tools.

For the new coordinate reader test I would like to generate a set of artificial test-files so that if possible we have the same trajectory saved in each data-format.

There are two questions to this. What is the best way to create this new set of test-data? How can we be sure about the content of the files?

@orbeckst noted that if we create the files with MDAnalysis itself we might have a bug in the writer module that cancels out with a bug in the reader module. This is a valid concern bug I think this might also happen if we use any other tool (There is no guarantee that the other program has no bugs).

I instead would suggest that we do create the files with MDAnalysis, for convenience, and later valid their content with different tools (eg gmxcheck, vmd). In this approach we take care that the files actually contain the information that we want them to and double check that instead of relying on other tools to produce the right files.

I think this is mostly a concern for the binary formats. The ascii files can be checked by hand/eye since the trajectory we would use is very short. I did a short check into which category each format we support fits.

Clear Text formats: pdb, psf, xyz, pqr, crd, gro, prmtop, pdbqt, dlpoly, gms, Howngrown binary format: xtc, trr, dcd, tpr external standart base binary format: netcdf, dms (file told be that this is basically a SQLite3 file)

Problematic file formats are the homegrown binary formats (xtc and trr are not that much of a problem I've gotten fairly comfortable with the xdr library by now). Those we have to check with the tools provided by the projects that introduced that format. I'm not that concerned with the sqlite and netcdf formats, those are multipurpose libaries that are 'self-documenting' and rather easy to check for the content by several different tools.

@orbeckst raised the concern that some of the formats are not really well documented, PDB is a good example. There are some standards about what a valid pdb-file is but a lot of tools ignore that and still read the file even if some columns are missing. I don't think currently that this is a problem. As I have seen in the coordinate tests we only check for mal-formated but still ok file in the case of the PDB reader. We don't do that for other file formats. And if we see that there some format should also still be valid with 'missing data' we would just add a new test case (with the bonus that it is explicitly documented as a test.

Additionally we should have a document that describes how the test-files have been created. That document should also contain some information about how to validate the content of each test-file. This information might come in handy in the future.

richardjgowers commented 9 years ago

I think the thing that's been missed is the 'other tools' to generate the trajectories are the MD programs themselves. So these are pretty much gospel to us, and if there's a bug in MDProgram V1.0, it's not really a bug to us, and we have to read the coordinates anyway.

Having the same coordinates for all the test files might make them a bit neater, but it doesn't actually make them "better" tests. As long as there's a known set of coordinates that they represent then that's enough.

If you really want to use the same coordinates, you could create the trajectory using a utility from the MD package (like what @jbarnoud did making the tpr tests). Then it's passed through the program, so it's a "real" trajectory.

richardjgowers commented 9 years ago

I just had a though re: the splitting of tests and package. If we do use synthetic tests, they're probably going to have small filesizes, which means putting them inside the package is less painful. So maybe we could have skinny tests and all the tests. I'm not sure if that would just get messy for both users and devs though.

kain88-de commented 9 years ago

having the tests in the source directory seems pretty standart to me. Most python modules I know do that. But they don't ship the tests.

Having the same coordinates for all the test files might make them a bit neater, but it doesn't actually make them "better" tests. As long as there's a known set of coordinates that they represent then that's enough.

Yeah that is right. With all having the same in an easy predictable manner it is just really neat to write the test and check that whole frames are read correctly. But I have found another solution which would give us neat code and arbitrary frame content at the same time. MDTraj is storing the frames for each test-file separately in a hdf5 file. Then the code to create a Reference for a new Reader test would be.

class XTCReference(BaseReference):
    h5 = tables/h5py_wrapper_func(get_file('XTC-test.h')
    self.first_frame = h5['first_frame']
   self.second_frame = h5['second_frame']

Still quite neat to me and we would check the whole frame. It will make repository and tests module a bit bigger in size and we depend on PyTables/h5py for the tests though. I actually like this approach and I think it is better to check the complete frames for correctness then just the center of mass like we do now.

dotsdl commented 9 years ago

@kain88-de I like the idea of having a reference set of coordinates in a flexible format, and I think mdtraj's HDF5 format spec fits the bill. This will require a new reader and writer, but I think the format is worth supporting.

orbeckst commented 9 years ago

Although installing hdf5 to run the tests while we are also thinking about dropping hdf5 requirements for the netcdf reader is a step backwards in terms of ease of installation.

I'd like to be able to tell any user to run the tests without having to require them install more than what's require for the actual package.

Oliver Beckstein email: orbeckst@gmail.com skype: orbeckst

Am Oct 26, 2015 um 7:22 schrieb David Dotson notifications@github.com:

@kain88-de I like the idea of having a reference set of coordinates in a flexible format, and I think mdtraj's HDF5 format spec fits the bill. This will require a new reader and writer, but I think the format is worth supporting.

— Reply to this email directly or view it on GitHub.

hainm commented 9 years ago

I'd like to be able to tell any user to run the tests without having to require them install more than what's require for the actual package.

I agree with @orbeckst about this point.

@dotsdl: isn't Amber's netcdf format flexible enough? :P (j/k). As far as I know, VMD has not yet supported mdtraj's format yet, so I would stick with supported ones. The more formats you introduce, the more you and developers need to spend time on maintain it. (actually @swails has great experience about those stuff, so I just take some points from discussion with him).

kain88-de commented 9 years ago

Although installing hdf5 to run the tests while we are also thinking about dropping hdf5 requirements for the netcdf reader is a step backwards in terms of ease of installation.

ONLY for the tests. It is nice that normal users can run the test-suite by installing it but I don't see any special value in it. I would think the number of user actually running the tests very low, they are mainly a tool for devs and I don't feel particularly bad about them having to install another dependency.

orbeckst commented 9 years ago

On 26 Oct, 2015, at 13:58, kain88-de wrote:

Although installing hdf5 to run the tests while we are also thinking about dropping hdf5 requirements for the netcdf reader is a step backwards in terms of ease of installation.

ONLY for the tests. It is nice that normal users can run the test-suite by installing it but I don't see any special value in it. I would think the number of user actually running the tests very low, they are mainly a tool for devs and I don't feel particularly bad about them having to install another dependency.

Having a user run the tests to confirm that their installation is working properly (or not) is quite valuable for debugging. My impression is that having low barriers is important when dealing with actual users.

Why not just pickle numpy arrays and/or dicts? Admittedly, we loose on prettiness and elegance but win on portability. Or netcdf 3, if we decide to drop netCDF4? Picking up @hainm 's suggestion: I think the Amber format does not cover all bases (e.g. TRZ can contain huge amounts of additional data, IIRC) but one could probably easily bolt it onto something resembling an Amber nc format.

richardjgowers commented 9 years ago

@orbeckst I was thinking the same re: valuable for end users for debugging, but the fact the tests are a separate package negates this somewhat. Would be a good argument for trying to slim down the test a little and packaging them together.

orbeckst commented 9 years ago

On 26 Oct, 2015, at 17:45, Richard Gowers wrote:

@orbeckst I was thinking the same re: valuable for end users for debugging, but the fact the tests are a separate package negates this somewhat.

As long as

pip install MDAnalysisTests

works, I don't see this as a huge problem.

Would be a good argument for trying to slim down the test a little and packaging them together.

Sure, that would be even more convenient, if one could do it in a sensible fashion.

MDAnalysis / mdanalysis

Test file generation #484