Closed RichardBruskiewich closed 3 years ago
The one hash function currently in use by the Unsecret Agent team is. E.g.:
sha1sum ~/Downloads/semmed.data.zip
e7276d5afac1d13b2909a05618ca14fc07f88c95 semmed.data.zip
What's entailed here is to compute the sha1sum on the file in the client browser before the upload, then upload the hash then have the server recheck the uploaded data. The sha1sum "file" should be added to the KGE file archive.
Any archive created on the server side, for downloading, would also have a sha1sum computed and available for independent downloading by the UI (and/or CLI and/or program library).
The one hash function currently in use by the Unsecret Agent team is. E.g.:
sha1sum ~/Downloads/semmed.data.zip e7276d5afac1d13b2909a05618ca14fc07f88c95 semmed.data.zip
Hi @jeffhhk, I'd just like to clarify something. In your mind, does semmed.data.zip
include both the nodes and the edges you use in your reasoner? In other words, with the hash, are you tracking the uniqueness of the knowledge graph on a whole?
@RichardBruskiewich
compute the sha1sum on the file in the client browser before the upload
Hash before the upload? What would be the benefit? Hashing before upload would compound the significant performance problems in the upload implementation. It would also close off the possibility of labeling the upload with extra information.
@kbruskiewicz Great question. The purpose of the sha1 hash is to track the identity of a particular incarnation of a particular knowledge graph. Thus, if we observe an artifact with a certain sha1 in our system, and we see the same sha1 in KGE, then we can know (with high probabilistic bound) that we do not have to download or reprocess said artifact.
The only thing our system knows how to process is a whole knowledge graph. We do not have a use case for processing one file of a File Set.
@RichardBruskiewich
compute the sha1sum on the file in the client browser before the upload
Hash before the upload? What would be the benefit? Hashing before upload would compound the significant performance problems in the upload implementation. It would also close off the possibility of labeling the upload with extra information.
Richard is referring to some spit-balling we did when we were first thinking through the issue.
I asked this question about handling the archive vs handling files in the archive as it does affect my implementation strategy - our current understanding wants to hash server-side, at the point where an archive is generated. I will continue any broader thoughts in #45.
Done!
... generated in the post-processing step after data set uploads