emorice / gemz

Gene expression model zoo
0 stars 0 forks source link

Store a regression reference in git #21

Open emorice opened 1 year ago

emorice commented 1 year ago

We want to be able to do regressions without dumping huge files in the repo. But for reproducibility and automation, it is crucial to be able to re-generate a reference set of regression files at any point. So all the information relevant to reproduce a set of files that was manually flagged as correct needs to be stored in the repo. That would mean at the very least a commit id, and probably also a pip dump since we do not freeze dependencies elsewhere. Some system info too, at least the python version. Maybe checksums could be nice too.

One tricky point is the workflow to add/change regression files. With this proposal you would need two commits, one that introduces the change, and would fail the regression test, and one that references the first and flags the changes as accepted. That sounds reasonable (it's not that different from, say, bumping submodules) but a bit of a process burden (like... bumping submodules).

emorice commented 1 year ago

Alternatively, one could commit checksums along with the change. However, I'm not sure how easy these would be to reproduce, some numerical results may change with system libraries or cpu architecture (I'm especially worried about different system having different blas implementations/versions), so even freezing the python packages may not be enough. When this happens, new regression files could mismatch the checksums included by the developer in the commit, but still pass when regenerating the files from the very same commit. Most likely numerical regression only make sense as a comparisons of two versions of the code on the same system.

Maybe the repo could include file stubs with a comment describing the reason for the last change. Then one could git blame the stub to obtain the correct commit and generate the actual file.