biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
92 stars 95 forks source link

JSON and HDF5 output data not reproducible due to timestamp #895

Closed peterjc closed 1 year ago

peterjc commented 1 year ago

Quoting table.py, both methods to_json and to_hdf5 use the following:

date = '"date": "%s",' % datetime.now().isoformat()

Using a live date means otherwise reproducible analysis will fail a simple diff due to the time stamp.

Quoting https://biom-format.org/documentation/format_versions/biom-1.0.html

date : <datetime> Date the table was built (ISO 8601 format)

Quoting https://biom-format.org/documentation/format_versions/biom-2.0.html and https://biom-format.org/documentation/format_versions/biom-2.1.html

creation-date : <datetime> Date the table was built (ISO 8601 format)

In both cases, this is clearly a required field, so I think the best solution is to allow the date to be passed as an optional argument (defaulting to the current default of now). The user could then explicitly use (for example) the last modified date of their input data and metadata. It would also facilitate using diff for continuous integration testing.

In comparison, although the BAM format for sequencing data uses the GZIP header, most implementations deliberately do not fill in the MTIME field, ensuring full reproducibility.

wasade commented 1 year ago

Thanks, @peterjc! I completely agree with the this proposition. For additional context, the exact lines impacted are here and here.

These should be pretty minor changes to make. I'll add them on the next release, and I think cutting a minor one relatively quickly to support this is valuable.