HERA-Team / librarian

The HERA Librarian.
BSD 2-Clause "Simplified" License
7 stars 11 forks source link

add file heritage tracking #6

Closed dannyjacobs closed 8 years ago

dannyjacobs commented 8 years ago

Add a "parent_md5" column to the files table so that files can indicate the md5 sum of the most immediate predecessor file.

example: filename md5 parent_md5 /data1/2457311/zen.2457311.25783.xx.uv afadfdfru98er23uhr98faewfe98h4rafdjfh2398thdaf9e8y null /data1/2457311/zen.2457311.25783.xx.uvA dfeoijfanqer3498jgw409w34jg4509hrg09wj0459gjg099 afadfdfru98er23uhr98faewfe98h4rafdjfh2398thdaf9e8y /data1/2457311/zen.2457311.25783.xx.uvM 54ywinwrv09jewfa098jq490jvw3409gjger0t9j430f9jw34 afadfdfru98er23uhr98faewfe98h4rafdjfh2398thdaf9e8y

david-macmahon commented 8 years ago

Sorry if this email reply is out-of-band, but is md5 going to be an indexed column or would this be a form of de-normalization (in DB lingo)? From someone who’s not been keeping up, it seems like the parent foreign key would be sufficient to refer (indirectly) to the parent’s md5 value.

Dave

On Feb 18, 2016, at 4:26 PM, Danny Jacobs notifications@github.com wrote:

Add a "parent_md5" column to the files table so that files can indicate the md5 sum of the most immediate predecessor file.

example: filename md5 parent_md5 /data1/2457311/zen.2457311.25783.xx.uv afadfdfru98er23uhr98faewfe98h4rafdjfh2398thdaf9e8y null /data1/2457311/zen.2457311.25783.xx.uvA dfeoijfanqer3498jgw409w34jg4509hrg09wj0459gjg099 afadfdfru98er23uhr98faewfe98h4rafdjfh2398thdaf9e8y /data1/2457311/zen.2457311.25783.xx.uvM 54ywinwrv09jewfa098jq490jvw3409gjger0t9j430f9jw34 afadfdfru98er23uhr98faewfe98h4rafdjfh2398thdaf9e8y

— Reply to this email directly or view it on GitHub https://github.com/HERA-Team/librarian/issues/6.

pkgw commented 8 years ago

We'd need MD5 to be an indexed column to be able to look up files without having to scan every single file known to the Librarian!

This does raise the issue in that this is a bit denomalize-y since we're acting as if the MD5 is a unique identifier for each file ... which it almost is, but do we want to allow a case where two "different" files have the same contents? If so, this MD5-based scheme is busted. I really wanted to avoid having arbitrary global "file ID numbers" but I'm not sure what the best approach is here.

dannyjacobs commented 8 years ago

We're now thinking that it will be simpler to have a HISTORY table that provides a generic place to note actions taken against files. This will enable external services to select files that have not been processed and just generically be able to track status of things.

columns: id (pk, auto inc), status (short consistent string that can be selected against),payload (long string), datetime (record creation timestamp)

pkgw commented 8 years ago

History table added to DB so this feature is tabled for now.