NCATSTranslator / Knowledge_Graph_Exchange_Registry

The Biomedical Data Translator Consortium site for development of Knowledge Graph Exchange Standards and Registry
MIT License
5 stars 3 forks source link

KGE files / archives should have md5 and/or sha256 hashes generated and available for download #35

Closed RichardBruskiewich closed 3 years ago

RichardBruskiewich commented 3 years ago

... generated in the post-processing step after data set uploads

jeffhhk commented 3 years ago

The one hash function currently in use by the Unsecret Agent team is. E.g.:

sha1sum ~/Downloads/semmed.data.zip
e7276d5afac1d13b2909a05618ca14fc07f88c95  semmed.data.zip
RichardBruskiewich commented 3 years ago

What's entailed here is to compute the sha1sum on the file in the client browser before the upload, then upload the hash then have the server recheck the uploaded data. The sha1sum "file" should be added to the KGE file archive.

Any archive created on the server side, for downloading, would also have a sha1sum computed and available for independent downloading by the UI (and/or CLI and/or program library).

kennethbruskiewicz commented 3 years ago

The one hash function currently in use by the Unsecret Agent team is. E.g.:

sha1sum ~/Downloads/semmed.data.zip
e7276d5afac1d13b2909a05618ca14fc07f88c95  semmed.data.zip

Hi @jeffhhk, I'd just like to clarify something. In your mind, does semmed.data.zip include both the nodes and the edges you use in your reasoner? In other words, with the hash, are you tracking the uniqueness of the knowledge graph on a whole?

jeffhhk commented 3 years ago

@RichardBruskiewich

compute the sha1sum on the file in the client browser before the upload

Hash before the upload? What would be the benefit? Hashing before upload would compound the significant performance problems in the upload implementation. It would also close off the possibility of labeling the upload with extra information.

jeffhhk commented 3 years ago

@kbruskiewicz Great question. The purpose of the sha1 hash is to track the identity of a particular incarnation of a particular knowledge graph. Thus, if we observe an artifact with a certain sha1 in our system, and we see the same sha1 in KGE, then we can know (with high probabilistic bound) that we do not have to download or reprocess said artifact.

The only thing our system knows how to process is a whole knowledge graph. We do not have a use case for processing one file of a File Set.

kennethbruskiewicz commented 3 years ago

@RichardBruskiewich

compute the sha1sum on the file in the client browser before the upload

Hash before the upload? What would be the benefit? Hashing before upload would compound the significant performance problems in the upload implementation. It would also close off the possibility of labeling the upload with extra information.

Richard is referring to some spit-balling we did when we were first thinking through the issue.

I asked this question about handling the archive vs handling files in the archive as it does affect my implementation strategy - our current understanding wants to hash server-side, at the point where an archive is generated. I will continue any broader thoughts in #45.

RichardBruskiewich commented 3 years ago

Done!