Closed sbarbosadataverse closed 9 years ago
MD5s are commonly used to verify that files were not corrupted during download. Every Mac and and Linux box has the native ability to calculate an MD5 of a file. For Windows it's a supported addon: https://support.microsoft.com/en-us/kb/841290
This is issue is not well defined as it is. @thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.
we're first and foremost trying to communicate with users, almost none of which know about either md5 or unfs. we have taken them another step and told them about unfs; we have a page describing them, and when we put in enough effort they get what they are. there's no reason to introduce another new concept. let's just use another degree of indirection. if the librarians want something for files for which we don't have unfs, then we create a broader notion of what a unf is and we always have a unf for every file. the plan would be quite like the idea of dv to begin with, which is that the more it knows about the files which are uploaded, the more services we provide. for unfs, we can do the same thing: if the file is in a format we know what to do with (R, sas, spss, table, etc.) we can compute a format-independent UNF. If it is anything else we can create a format-dependent UNF, which if you want would be exactly a MD5, but would be displayed as a UNF. we can also add another service if librarians want, somewhere far out of the way of most users, that lets people type in a unf and have dv tell them exactly what it is and how it was calculated, including the full algorithm, an MD5 if it is in there, and anything else.
then from a UI/user understanding point of view, there will be only one thing to understand; they can ignore the details and trust us if they want; they can get the details if they like; and we can continue to innovate what a UNF is since there's a version number embedded in it (and i agree that we should do the latter; people tell me that videos and photos wouldn't be hard and we could clearly expand to more forms of data. this, however, is a separate project that we could perhaps seek funding for and do then).
Gary King - Albert J. Weatherhead III University Professor - Director, IQSS http://iq.harvard.edu/- Harvard University GaryKing.org - King@Harvard.edu - @KingGary https://twitter.com/kinggary - 617-500-7570 - fax 812-8581 - Assistant king-assist@iq.harvard.edu: 495-9271
On Fri, May 22, 2015 at 2:06 PM, Merce Crosas notifications@github.com wrote:
This is issue is not well defined as it is. @thegaryking https://github.com/thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin https://github.com/pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.
— Reply to this email directly or view it on GitHub https://github.com/IQSS/dataverse/issues/2192#issuecomment-104732568.
Yes, I agree with the general approach, but we need to do some research to implement this well. Here are the three main issues:
I agree it would be great, as you say, to generalize UNF and make it work well across all cases, so we use and teach only one thing. But we need to take these three issues (and I might be missing others that Leonid, Kevin and others in the team might know) in consideration.
Mercè Crosas, Ph.D. Director of Data Science, IQSS Harvard University http://scholar.harvard.edu/mercecrosas
On Sat, May 23, 2015 at 10:27 AM, Gary King notifications@github.com wrote:
we're first and foremost trying to communicate with users, almost none of which know about either md5 or unfs. we have taken them another step and told them about unfs; we have a page describing them, and when we put in enough effort they get what they are. there's no reason to introduce another new concept. let's just use another degree of indirection. if the librarians want something for files for which we don't have unfs, then we create a broader notion of what a unf is and we always have a unf for every file. the plan would be quite like the idea of dv to begin with, which is that the more it knows about the files which are uploaded, the more services we provide. for unfs, we can do the same thing: if the file is in a format we know what to do with (R, sas, spss, table, etc.) we can compute a format-independent UNF. If it is anything else we can create a format-dependent UNF, which if you want would be exactly a MD5, but would be displayed as a UNF. we can also add another service if librarians want, somewhere far out of the way of most users, that lets people type in a unf and have dv tell them exactly what it is and how it was calculated, including the full algorithm, an MD5 if it is in there, and anything else.
then from a UI/user understanding point of view, there will be only one thing to understand; they can ignore the details and trust us if they want; they can get the details if they like; and we can continue to innovate what a UNF is since there's a version number embedded in it (and i agree that we should do the latter; people tell me that videos and photos wouldn't be hard and we could clearly expand to more forms of data. this, however, is a separate project that we could perhaps seek funding for and do then).
Gary
Gary King - Albert J. Weatherhead III University Professor - Director, IQSS http://iq.harvard.edu/- Harvard University
GaryKing.org - King@Harvard.edu - @KingGary https://twitter.com/kinggary
617-500-7570 - fax 812-8581 - Assistant king-assist@iq.harvard.edu: 495-9271
On Fri, May 22, 2015 at 2:06 PM, Merce Crosas notifications@github.com wrote:
This is issue is not well defined as it is. @thegaryking https://github.com/thegaryking does this question apply to subsettable Tabular files? why is it bad to have both UNF and MD5? as @pdurbin https://github.com/pdurbin comments, MD5 are the most commonly used by preservation groups and libraries, and can be useful in addition to UNF to verify the original deposited file.
— Reply to this email directly or view it on GitHub https://github.com/IQSS/dataverse/issues/2192#issuecomment-104732568.
— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D104902198&d=BQMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=n42TBHZeNCFjVWht9OJze2EDoPR2o7n87LpnMd0UIlQ&s=3H3Wxs075lnSQvTNGBQFfpirL4UfeRAgp1akLUnwL94&e= .
I agree with @mcrosas that the MD5 checksum should exist for every file to ensure bit-level preservation. When I presented a preview of Dataverse 4.0 to the Library of Congress' National Digital Stewardship Alliance in the Fall they were particularly impressed that we included MD5s for all our files. Here's a blog post from them discussing the importance of file fixity/data integrity: http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/
MD5 is a standard that the digital archival community trusts whereas UNF was unknown to them. I don't think we should replace MD5s for all files with UNFs if the community isn't using them outside of Dataverse.
ok, but let's get MD5s out of the file list now. we can stick it in the metadata when we expand the long list, as an unchangable item if someone wants it.
and separately, let's create a google doc or something with specifications of a UNF that would satisfy everyone. we can even create the specs and either get the grant ourselves or have a call for PIs to take on this task, perhaps with us. we know pretty much everything we want, and all the problems with the current UNF. we just need some bandwidth (or someone else) to implement it all.
Gary King - Albert J. Weatherhead III University Professor - Director, IQSS http://iq.harvard.edu/- Harvard University GaryKing.org - King@Harvard.edu - @KingGary https://twitter.com/kinggary - 617-500-7570 - fax 812-8581 - Assistant king-assist@iq.harvard.edu: 495-9271
On Sat, May 23, 2015 at 11:01 AM, Eleni Castro notifications@github.com wrote:
I agree with @mcrosas https://github.com/mcrosas that the MD5 checksum should exist for every file to ensure bit-level preservation. When I presented a preview of Dataverse 4.0 to the Library of Congress' National Digital Stewardship Alliance in the Fall they were particularly impressed that we included MD5s for all our files. Here's a blog post from them discussing the importance of file fixity/data integrity: http://blogs.loc.gov/digitalpreservation/2014/04/protect-your-data-file-fixity-and-data-integrity/
MD5 is a standard that the digital archival community trusts whereas UNF was unknown to them. I don't think we should replace MD5s for all files with UNFs if the community isn't using them outside of Dataverse.
— Reply to this email directly or view it on GitHub https://github.com/IQSS/dataverse/issues/2192#issuecomment-104904884.
Maybe we should remove both UNFs and MD5s from the default listing for files. They add a lot of noise.
I just clicked on a random dataset and saw this for a PowerPoint file:
Isn't this a little... noisy... busy... unfriendly?
Who cares that it's MD5 is 26a3bb59a1d9a837ea51cc9c160c5b1a
? (In addition, who cares that it's MIME Type is application/vnd.openxmlformats-officedocument.presentationml.presentation
?) It's a PowerPoint file! That's all most people need to know. If normal people download it and can't open it they'll throw it in the trash and try again. If the still can't open it, they'll email the dataset contact and say "Hey, I think you uploaded a corrupted PowerPoint file" (which is more likely than the file being corrupted during download). They're not going to calculate the MD5 locally (let alone the UNF) and compare it to the MD5 on the screen. Only geeks like me would even think of doing that. And I probably wouldn't bother. I'd email the dataset contact.
Sure, show that it's 3 megabytes. Show the date it was uploaded. Stuff like MD5 and UNF could be hidden behind a "details" link, perhaps with some definitions of what MD5 and UNF even are.
@pdurbin we should not just remove these from the file cards without the appropriate research and consideration of preservation good practices - it has been an expectation from users and partners to easily find the fixity even if it's not used all the time. But you bring good points, it's worth reviewing if they can be displayed in another place.
I'm assigning this issue to @mheppler and adding it to the "In Design" milestone, following our process. Once a designed is proposed and reviewed (by @thegaryking and partners who requested the MD5), we'll move it a Release Candidate milestone.
To summarize, based on @thegaryking comments above:
@pdurbin - yes, the long Microsoft mime types are terrible. But we have a mechanism for dealing with this - it's just a matter of adding the "friendly" version of it (such as "PowerPoint") to the list we maintain. (it's a .property file). The friendly types for Excel and Word are already there. PowerPoint was left out, probably because it's not as common.
Thank you for commenting on that @landreev. I was going to ask you about these "friendly" file types, since I recall going over these with you for the file icons. We should separate out that task of identifying as many of these file types as we can in our current production data, and giving them friendly labels.
@mheppler Yes, we could use a dedicated ticket for creating these "friendly" labels for as many types as possible. The file in question is ./src/main/java/MimeTypeDisplay.properties
@landreev @pdurbin -- #2202 -- new issue for MIME Type improvements created. Enjoy.
Need to discuss this during a UI/UX team meeting to brainstorm ideas on how to show more file metadata without being overwhelming in the file card on the dataset page. Perhaps having a files metadata section in the metadata tab. @mcrosas @mheppler
After reviewing it with @eaquigley and @mheppler we plan to move this to 4.0.3.
Have a section in the metadata tab that is "Files" and displays this extra metadata (MD5 shows here and not on the files card if a UNF is available).
I had one question about this--- Is there a safeguard in place to ensure MD5 gets assed when tabular ingest fails for any reason? We have so enough failures at the moment to cause me to ask. Thanks
Sonia Barbosa Manager of Data Curation, IQSS Dataverse Network Manager of the Murray Research Archive, IQSS Data Science Harvard University
Dataverse 4.0 is now available for use! http://dataverse.harvard.edu
All test dataverses should be created in 4.0 Demo! http://dataverse-demo.iq.harvard.edu/
Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-community
From: Michael Heppler [notifications@github.com] Sent: Monday, July 06, 2015 10:58 AM To: IQSS/dataverse Cc: Barbosa, Sonia Subject: Re: [dataverse] Citation: Remove MD5s, if you have UNF (#2192)
Mockups:
� Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_2192-23issuecomment-2D118881037&d=BQMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=tk1oUuKKr8BUvU3wsB_ht2KRojyvexAXpa6vYy0YRqw&s=FKZCdYxAvsauobBJFVRrmyeWZ1YEp6tgjrihdk3_bQY&e=.
Note: With the file landing page being pushed to 4.3, this removes the "Original File MD5" for tabular files completely from the UI.
OK looks good, closing.
Gary's sent the following:
why are there MD5's? these I think should all be removed. we have UNFs instead.