Closed scolapasta closed 4 years ago
We should be able to use the code from the original PSI branch as a starting point, but the key would be to generalize it, while still working for the Open DP project.
Architecture wise, we will almost definitely need another table to track these files (we already track original file info in the datatable table, but maybe that can eventually be moved to this new table.
Considerations: • Access - does this file have same or different access than original, how to handle
Edit: The retrieval API needs extra access authorization logic: it needs to default to the access rules as defined for the main datafile (extra metadata/sum stats. associated with sensitive/restricted datafiles should by default be restricted/private!); but have a mechanism for marking some of these fragments safe. As in, the whole point of diff. privacy is to make some sum. stats available, even if the contents of the datafile are completely private. Since we can have multiple versions of diff. private sum. stats associated with the same Datafile, potentially only some of them may need to be public. (L.A.)
• UI - does there need to be a UI to download this (some of the auxiliary files are meant to be for the end user, others will not be) • Provenance - where did this auxiliary file come from - we may also need to functionally know this to disable re config in some cases (such as open dp)
I meant to add some pointers to the old demo implementation of the APIs, that can be either copy-and-pasted from the old branch as a starting framework, or just used as an example: The API for uploading and saving the diff. private metadata fragment, as used for the psi demo:
@Path("datafile/{fileId}/metadata/preprocessed")
@POST
public Response saveTabularDataSummary(@PathParam("fileId") Long fileId, @FormDataParam("metadata") String jsonIn, @FormDataParam("diffPrivate") Boolean differentiallyPrivate, @FormDataParam("formatVersion") String formatVersion) {
Once the diff. private metadata fragment was uploaded for a tabular datafile, it could be retrieved using an (existing) GET version of the api above (in the same class file):
@Path("datafile/{fileId}/metadata/preprocessed")
@GET
@Produces({"application/json"})
(to which an extra boolean flag "diffPrivate=" was added).
FWIW: One useful thing that could be addressed in a redesign (this issue or separately) would be to keep the checksum for aux files. If I understand correctly, the checksum for original files (for ingest-able types) is not kept now (because they are aux files) and it would definitely be useful to be able to verify that Dataverse's copy matches the original source. I'm guessing it would be useful in checking other derived/openDP files as well (does my download match what Dataverse has?).
There is more info about the OpenDP requirements here: #7158
@qqmyers Agree that it makes sense to record and store the checksums of aux files. However:
If I understand correctly, the checksum for original files (for ingest-able types) is not kept now (because they are aux files)
We do actually keep the checksum for original files. As a somewhat counter-intuitive special case: for an ingested tabular file, the checksumvalue in the DataFile table IS the checksum of the saved original. The generated tabular file itself thus does not have the checksum saved in the database (under the assumption that the UNF makes it redundant). We may want to revisit this arrangement of course. (just noticed this today)
The use case, as relevant to the OpenDP implementation: Summary stats with added differential privacy "noise" will be generated by the OpenDP software for a specific Datafile, outside of Dataverse; then deposited into Dataverse, for later retrieval. This issue is for implementing the APIs, for the deposit and later retrieval of these fragments. (There will be multiple physical files in different formats - xml, json; and possibly multiple versions of such diff private metadata sets for a single datafile. On the implementation layer all these individual fragments will be saved as "auxiliary files" using the standard Dataverse StorageIO system).
More generally, this is an API for accepting, storing and serving some metadata that Dataverse cannot produce itself (unlike, for example, image thumbnails or DDI XML describing the data variables that Dataverse knows how to generate from Datafiles). So it must be deposited by an external application before Dataverse can serve it.
There is no direct UI impact. The deposit happens automatically, the remote OpenDP application performs it without the human user being directly involved. (The user does not need to know anything about this API). But once the diff. private metadata has been deposited for a Datafile, Dataverse will be showing extra options (on the dataset and/or file pages) - most likely an extra explore option. In the old PSI demo implementation the user would also see an option to download the deposited diff. private metadata (in Json format) in the normal download pulldown menu. (I'm assuming this issue is just for the API. And any such UI logic will be handled separately). Dataverse does NOT do anything with the actual diff. private metadata deposited (at least as currently defined)! It only knows how to store it for later retrieval and how to serve it on demand; and knows that certain extra things can be done by outside applications for datafiles that have diff. private metadata saved. (for example, Dataverse may know to add an extra redirect link to a diff. private data explore viewer for such a datafile; but all the "exploring" etc. will happen outside the Dataverse application).
(end use case description -- L.A.)
In planning how to handle the auxiliary files for upload and considering other possible future tools (e.g. a mapping tool or a time series graphing tool) that would require auxiliary metadata, it seems clear that designing a flexible system to upload auxiliary files would be useful.
Note we currently already have two types of auxiliary files: those that can already be recreated solely by dataverse and those that cannot. An example fo the first would be thumbnails for images. An example of the second is the original file on a tabular download.
This issue concerns only the latter, as the files that config tools would deposit are not recreate-able (without the use of the tool).