centraldedados / datacentral

Tools for generating portable data portals
58 stars 9 forks source link

Keep track of the versions hosted locally #3

Open rlafuente opened 10 years ago

rlafuente commented 10 years ago

The files at _output are re-generated at every run. While there are checks to see if the Git repository of the data package has changed, currently we have no way to know if the files at _output are stale or not.

This is an issue in the case of big datasets, taking some time to copy the CSV files to the download dir.

The solution would be a cache file that registers the last commit from which a data package was generated the checksum of each CSV data file to determine if the files are identical (and if not, they should be overwritten).

rlafuente commented 9 years ago

The cache file could be simple JSON with file name -> md5. (last commit is not ideal since the file may be dirty)