[not-a-bug] third-party hash database

vitzli commented 8 years ago

I've been told that your hash-archive project may have some use of my little database - I hashed bunch of semi-random sources and put them into purplehash-datasource. My main interest in this is mapping of (crc32,md5,sha1,sha256,…), file format is pretty stable over the course of time, but checksums included may vary (some have whirlpool+ripe-md and some use tth+aich, currently I'm using tth+aich option).

Hashing is done with rhash, because it is pretty fast and supports custom output templates.

btrask commented 8 years ago

Hey @vitzli, thanks for the suggestion!

Right now, hash-archive only maps hashes to URLs. Is it possible to determine URLs for the data in purplehash-datasource? That would be useful for allowing independent verification too.

I want to expand the database to allow mappings to other sources (like BitTorrent), but mapping to plain file names seems less useful to me. But it could work for files that are well-known by a given name (e.g. Debian ISOs).

Currently the hash-archive database schema is poorly designed for importing other datasets. I will work on making it more flexible and try to figure out storing provenance info (https://github.com/btrask/hash-archive/issues/5).

Do you have any suggestions for how to make importing data as easy as possible?

Thanks again!

vitzli commented 8 years ago

There is no information about the source or origin in purplehash-datasource right now, and I can write a manifest for each of them, but here is a problem: almost all files (except Debian and Tails, I think) store hashes that were derived from the original source, for example: gutenberg-isos.7z store hashes of file that were on Project Gutenberg .isos and I don't know about their proper URL - I could only find the URL for the original iso file; same for HackedTeam - it was a torrent file and there is no http source for any of those files, while there are at least two mirrors of it: archive.org and ht.transparencytoolkit.org.

I can't suggest anything since your use case is different, but I can describe my: I use (source, item, file)→hash_tuple mapping, where source is the entity which holds that object (and they could also provide third-party hashes), item is the storage unit (HDD, CD, network path) and file is path and filename within the item; hash is (crc32, md5, sha1, sha256, …), md5 being minimal requirement, but it may not be unique, as primary key it uses index integer number, but using (md5, sha1_is_null, sha256_is_null) and (md5, sha1, sha256_is_null) as unique indexes and sha256 also has to be unique within hash table if it is present. It allows to keep independent lists of files that only have md5 for them, (md5, sha1) pair and full hash records (sha256, crc32 and everything else). While this system may look insane it allows to store records from NSRL and similar repositories, which don't do sha256 checksums and don't allow access to their file storage.

I have a schema for it, but there is no pretty chart for it, just a very WIP sql dump from postgres. To solve the derived object problem from above it maps, or at least should map, (source, item, file)→object, but it does no attempt to account mirrors – they are just different sources/items that manage to have the same file on it. Source for that matter is the place where them is stored and controlled by, it could Internet Archive, me or somebody else. Item would be HDD, usb stick, DVD disk or path in filesystem; file uses path withing item: if item is HDD mounted to /mnt/externalhdd/ then filepath should be anything below it, for example for /mnt/externalhdd/linux_isos/linux-dvd-1.iso filepath would be linux_isos/linux-dvd-1.iso and filename would be linux-dvd-1.iso.

As for URI, I planned to keep them inside jdata jsonb fields in items and files tables. This way I can map bittorent infohash and .iso image URL to the items record as a list and assume that path in files corresponds to the path inside the item.

My idea for dealing with third-party hashes was to use trust system similar to gpg, with following codes in mind: 0 - means do not trust, cannot update the hash table hbase, but can map hashes to the (source, item, file) records and maybe add new records to source/item tables. 1 - don't know, can add new (md5,sha1,sha256) records, but can't update any records in hash table; can add tags/URIs to files/items, but can't update them; 2 - low (sources that have some reputation, i.e. hashes from Internet Archive or National Software Reference Library - check out NSRL, they have md5/sha1 and sha1→sha256 mappings!) 3 - medium (direct import of GPG-signed images/files without hashing: CentOS, Debian, Tails), can update fields by 2-level sources; can update self; 4 - high trust (hashed files myself); can update 3-level; can update self;

With such system I can update records based on the trust I have in the source (i.e. high-trust source can update low-trust source's fields in the database or medium-trust sources.) Under this system my purplehashdb should be either 0 or 1-level source for anyone else.

hashedby_id field in hbase table is a foreign key to source_id: Source that performed hashing and is a point where the trust exists, trust level is stored in source_trust field in sources table; mapping to source_id could be very wrong and I have a feeling that I'm missing something important.

As I'm writing this I think it will be wrong at least in multi-user environment where different people can update or add hashes. It also does not account for priority, as some sources may be more preferable than other, 0…10 ratings (0-1-2…7-8,9-10) could be better.

In my schema/database I made several mistakes I know of:

user_id is rather an agent who performed the import than unrestricted, potentially rogue user from the Internet that could insert bad data, it really should be 'agent_id' in my case (i.e. 'cli_user@host1, webgui@host2 etc) it should have full trust of the system, but it may import from the source that could be wrong. When adding new item the agent has to either explicitly state the trust level for the new source or implicitly pull the trust level from the existing source.
table source and related fields should probably use the term authority as it better describes the behavior.

btrask commented 8 years ago

Thanks for the info!

Coming up with addresses for files within archives/containers (like 7z or ISO) is a constant problem I've been running into almost as long as I've been programming. @JesseWeinstein and I have a nice solution for WARCs, since resources inside a web archive have their own associated URLs. However, for other types of archives, the most you have is file paths which are inherently context-sensitive.

It seems like in at least some cases, the file name alone is "well known" (widely understood to represent the same file). However, even for those names, there is no standard way (ideally, a URL protocol) to represent them. Perhaps file:///name or file:///archive.zip/name is good enough.

There's also URL fragment identifiers (http://example.com/archive.zip#file.txt) but that is non-standard and could conflict with existing uses.

btrask / hash-archive-js

[not-a-bug] third-party hash database #4