Citation: Remove MD5s, if you have UNF

sbarbosadataverse commented 9 years ago

Gary's sent the following:

why are there MD5's? these I think should all be removed. we have UNFs instead.

pdurbin commented 9 years ago

MD5s are commonly used to verify that files were not corrupted during download. Every Mac and and Linux box has the native ability to calculate an MD5 of a file. For Windows it's a supported addon: https://support.microsoft.com/en-us/kb/841290

mercecrosas commented 9 years ago

This is issue is not well defined as it is. @thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.

thegaryking commented 9 years ago

we're first and foremost trying to communicate with users, almost none of which know about either md5 or unfs. we have taken them another step and told them about unfs; we have a page describing them, and when we put in enough effort they get what they are. there's no reason to introduce another new concept. let's just use another degree of indirection. if the librarians want something for files for which we don't have unfs, then we create a broader notion of what a unf is and we always have a unf for every file. the plan would be quite like the idea of dv to begin with, which is that the more it knows about the files which are uploaded, the more services we provide. for unfs, we can do the same thing: if the file is in a format we know what to do with (R, sas, spss, table, etc.) we can compute a format-independent UNF. If it is anything else we can create a format-dependent UNF, which if you want would be exactly a MD5, but would be displayed as a UNF. we can also add another service if librarians want, somewhere far out of the way of most users, that lets people type in a unf and have dv tell them exactly what it is and how it was calculated, including the full algorithm, an MD5 if it is in there, and anything else.

then from a UI/user understanding point of view, there will be only one thing to understand; they can ignore the details and trust us if they want; they can get the details if they like; and we can continue to innovate what a UNF is since there's a version number embedded in it (and i agree that we should do the latter; people tell me that videos and photos wouldn't be hard and we could clearly expand to more forms of data. this, however, is a separate project that we could perhaps seek funding for and do then).

Gary

Gary King - Albert J. Weatherhead III University Professor - Director, IQSS http://iq.harvard.edu/- Harvard University GaryKing.org - King@Harvard.edu - @KingGary https://twitter.com/kinggary - 617-500-7570 - fax 812-8581 - Assistant king-assist@iq.harvard.edu: 495-9271

On Fri, May 22, 2015 at 2:06 PM, Merce Crosas notifications@github.com wrote:

This is issue is not well defined as it is. @thegaryking https://github.com/thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin https://github.com/pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.

— Reply to this email directly or view it on GitHub https://github.com/IQSS/dataverse/issues/2192#issuecomment-104732568.

mercecrosas commented 9 years ago

Yes, I agree with the general approach, but we need to do some research to implement this well. Here are the three main issues:

If we only provide a UNF for tabular files (spss, stata, r, etc), and not an MD5 for the original format, then we don't have a way to verify the original deposited file, which is important in some cases when we need to recalculate the UNF or there is some issues or uncertainties with reformatting, or simply for standard archival verifications that repositories should do. This is a request from some groups that want to make sure we support preservation good practices. There are some preservation certificates that Dataverse could not get without this.
UNF is a fantastic concept, but it has some practical limitations and issues in the way that is currently defined. Given that each format treats some data types differently (time, binary and categorical variables, rounding), it could turn out that you convert from one format to another, to another and then to back to the first format, and end up with different UNFs (this is similar to the phenomenon of google translating from one language, to another, to another, back to the first and end up with different sentence). This has been improved considerably from the initial UNF version, but it's practically very difficult to include all the exceptions.
But on of the main issue with UNF is that it doesn't include the metadata of the file, that is the variable name, for example, or data type. This mean that you might have a spss file with column A that corresponds to var1, but this was not correct and for some reason needs to be changed to var2, and this is not reflected in the UNF, while this is a critical critical change in the data file.

I agree it would be great, as you say, to generalize UNF and make it work well across all cases, so we use and teach only one thing. But we need to take these three issues (and I might be missing others that Leonid, Kevin and others in the team might know) in consideration.

Mercè Crosas, Ph.D. Director of Data Science, IQSS Harvard University http://scholar.harvard.edu/mercecrosas

On Sat, May 23, 2015 at 10:27 AM, Gary King notifications@github.com wrote:

we're first and foremost trying to communicate with users, almost none of which know about either md5 or unfs. we have taken them another step and told them about unfs; we have a page describing them, and when we put in enough effort they get what they are. there's no reason to introduce another new concept. let's just use another degree of indirection. if the librarians want something for files for which we don't have unfs, then we create a broader notion of what a unf is and we always have a unf for every file. the plan would be quite like the idea of dv to begin with, which is that the more it knows about the files which are uploaded, the more services we provide. for unfs, we can do the same thing: if the file is in a format we know what to do with (R, sas, spss, table, etc.) we can compute a format-independent UNF. If it is anything else we can create a format-dependent UNF, which if you want would be exactly a MD5, but would be displayed as a UNF. we can also add another service if librarians want, somewhere far out of the way of most users, that lets people type in a unf and have dv tell them exactly what it is and how it was calculated, including the full algorithm, an MD5 if it is in there, and anything else.

then from a UI/user understanding point of view, there will be only one thing to understand; they can ignore the details and trust us if they want; they can get the details if they like; and we can continue to innovate what a UNF is since there's a version number embedded in it (and i agree that we should do the latter; people tell me that videos and photos wouldn't be hard and we could clearly expand to more forms of data. this, however, is a separate project that we could perhaps seek funding for and do then).

Gary

Gary King - Albert J. Weatherhead III University Professor - Director, IQSS http://iq.harvard.edu/- Harvard University

GaryKing.org - King@Harvard.edu - @KingGary https://twitter.com/kinggary

617-500-7570 - fax 812-8581 - Assistant king-assist@iq.harvard.edu: 495-9271

On Fri, May 22, 2015 at 2:06 PM, Merce Crosas notifications@github.com wrote:

This is issue is not well defined as it is. @thegaryking https://github.com/thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin https://github.com/pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.

— Reply to this email directly or view it on GitHub https://github.com/IQSS/dataverse/issues/2192#issuecomment-104732568.

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D104902198&d=BQMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=n42TBHZeNCFjVWht9OJze2EDoPR2o7n87LpnMd0UIlQ&s=3H3Wxs075lnSQvTNGBQFfpirL4UfeRAgp1akLUnwL94&e= .

posixeleni commented 9 years ago

I agree with @mcrosas that the MD5 checksum should exist for every file to ensure bit-level preservation. When I presented a preview of Dataverse 4.0 to the Library of Congress' National Digital Stewardship Alliance in the Fall they were particularly impressed that we included MD5s for all our files. Here's a blog post from them discussing the importance of file fixity/data integrity: http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/

MD5 is a standard that the digital archival community trusts whereas UNF was unknown to them. I don't think we should replace MD5s for all files with UNFs if the community isn't using them outside of Dataverse.

thegaryking commented 9 years ago

ok, but let's get MD5s out of the file list now. we can stick it in the metadata when we expand the long list, as an unchangable item if someone wants it.

and separately, let's create a google doc or something with specifications of a UNF that would satisfy everyone. we can even create the specs and either get the grant ourselves or have a call for PIs to take on this task, perhaps with us. we know pretty much everything we want, and all the problems with the current UNF. we just need some bandwidth (or someone else) to implement it all.

Gary

Gary King - Albert J. Weatherhead III University Professor - Director, IQSS http://iq.harvard.edu/- Harvard University GaryKing.org - King@Harvard.edu - @KingGary https://twitter.com/kinggary - 617-500-7570 - fax 812-8581 - Assistant king-assist@iq.harvard.edu: 495-9271

On Sat, May 23, 2015 at 11:01 AM, Eleni Castro notifications@github.com wrote:

I agree with @mcrosas https://github.com/mcrosas that the MD5 checksum should exist for every file to ensure bit-level preservation. When I presented a preview of Dataverse 4.0 to the Library of Congress' National Digital Stewardship Alliance in the Fall they were particularly impressed that we included MD5s for all our files. Here's a blog post from them discussing the importance of file fixity/data integrity: http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/

MD5 is a standard that the digital archival community trusts whereas UNF was unknown to them. I don't think we should replace MD5s for all files with UNFs if the community isn't using them outside of Dataverse.

— Reply to this email directly or view it on GitHub https://github.com/IQSS/dataverse/issues/2192#issuecomment-104904884.

pdurbin commented 9 years ago

Maybe we should remove both UNFs and MD5s from the default listing for files. They add a lot of noise.

I just clicked on a random dataset and saw this for a PowerPoint file:

documentation_and_metadata_-_training_materials_dataverse_-_2015-05-24_10 00 24

Isn't this a little... noisy... busy... unfriendly?

Who cares that it's MD5 is 26a3bb59a1d9a837ea51cc9c160c5b1a? (In addition, who cares that it's MIME Type is application/vnd.openxmlformats-officedocument.presentationml.presentation?) It's a PowerPoint file! That's all most people need to know. If normal people download it and can't open it they'll throw it in the trash and try again. If the still can't open it, they'll email the dataset contact and say "Hey, I think you uploaded a corrupted PowerPoint file" (which is more likely than the file being corrupted during download). They're not going to calculate the MD5 locally (let alone the UNF) and compare it to the MD5 on the screen. Only geeks like me would even think of doing that. And I probably wouldn't bother. I'd email the dataset contact.

Sure, show that it's 3 megabytes. Show the date it was uploaded. Stuff like MD5 and UNF could be hidden behind a "details" link, perhaps with some definitions of what MD5 and UNF even are.

mercecrosas commented 9 years ago

@pdurbin we should not just remove these from the file cards without the appropriate research and consideration of preservation good practices - it has been an expectation from users and partners to easily find the fixity even if it's not used all the time. But you bring good points, it's worth reviewing if they can be displayed in another place.

I'm assigning this issue to @mheppler and adding it to the "In Design" milestone, following our process. Once a designed is proposed and reviewed (by @thegaryking and partners who requested the MD5), we'll move it a Release Candidate milestone.

To summarize, based on @thegaryking comments above:

This issue is about removing MD5 from the file card of tabular/subsettable files (but keeping UNF) and finding the appropriate place to display the MD5. Files that don't have UNF will display MD5, as it the case now. (As a note, the metadata tab might not be the right place for the MD5 of tabular files since it has dataset metadata but not file level metadata, although we should still consider it. @mheppler I have some ideas about this, for when you are ready to work on it)
For the larger task of generalizing UNF, I'll create a new issue in GitHub and start a Functional Requirements Document, as we do for new features or components, and invite others to review it.

landreev commented 9 years ago

@pdurbin - yes, the long Microsoft mime types are terrible. But we have a mechanism for dealing with this - it's just a matter of adding the "friendly" version of it (such as "PowerPoint") to the list we maintain. (it's a .property file). The friendly types for Excel and Word are already there. PowerPoint was left out, probably because it's not as common.

mheppler commented 9 years ago

Thank you for commenting on that @landreev. I was going to ask you about these "friendly" file types, since I recall going over these with you for the file icons. We should separate out that task of identifying as many of these file types as we can in our current production data, and giving them friendly labels.

landreev commented 9 years ago

@mheppler Yes, we could use a dedicated ticket for creating these "friendly" labels for as many types as possible. The file in question is ./src/main/java/MimeTypeDisplay.properties

mheppler commented 9 years ago

@landreev @pdurbin -- #2202 -- new issue for MIME Type improvements created. Enjoy.

eaquigley commented 9 years ago

Need to discuss this during a UI/UX team meeting to brainstorm ideas on how to show more file metadata without being overwhelming in the file card on the dataset page. Perhaps having a files metadata section in the metadata tab. @mcrosas @mheppler

mercecrosas commented 9 years ago

After reviewing it with @eaquigley and @mheppler we plan to move this to 4.0.3.

eaquigley commented 9 years ago

Have a section in the metadata tab that is "Files" and displays this extra metadata (MD5 shows here and not on the files card if a UNF is available).

mheppler commented 9 years ago

FRD: https://docs.google.com/document/d/1v-6WuFyClnAAHqyMf1VsWtCdXDTTR-ikuG6Ou8RtDMM/edit

Mockups:

sbarbosadataverse commented 9 years ago

I had one question about this--- Is there a safeguard in place to ensure MD5 gets assed when tabular ingest fails for any reason? We have so enough failures at the moment to cause me to ask. Thanks

Sonia Barbosa Manager of Data Curation, IQSS Dataverse Network Manager of the Murray Research Archive, IQSS Data Science Harvard University

Dataverse 4.0 is now available for use! http://dataverse.harvard.edu

All test dataverses should be created in 4.0 Demo! http://dataverse-demo.iq.harvard.edu/

Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-community

From: Michael Heppler [notifications@github.com] Sent: Monday, July 06, 2015 10:58 AM To: IQSS/dataverse Cc: Barbosa, Sonia Subject: Re: [dataverse] Citation: Remove MD5s, if you have UNF (#2192)

FRD: https://docs.google.com/document/d/1v-6WuFyClnAAHqyMf1VsWtCdXDTTR-ikuG6Ou8RtDMM/edithttps://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1v-2D6WuFyClnAAHqyMf1VsWtCdXDTTR-2DikuG6Ou8RtDMM_edit&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=Y4vvQiQubGh1CPsWkJEGEzDGpHO75B44oyjIlzC5t3Q&e=

Mockups:

� Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D118881037&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=FKZCdYxAvsauobBJFVRrmyeWZ1YEp6tgjrihdk3_bQY&e=.

mheppler commented 9 years ago

[x] Removed MD5 from dataverse browse/search card for tabular files when UNF is displayed.
[x] Removed MD5 from dataset list table for tabular files when UNF is displayed.
[x] Removed MD5 from top section of file landing when for tabular files UNF is displayed.

Note: With the file landing page being pushed to 4.3, this removes the "Original File MD5" for tabular files completely from the UI.

kcondon commented 9 years ago

OK looks good, closing.

IQSS / dataverse