danielskatz / software-vs-data

understanding and documenting the differences between software and data in the context of citation
Creative Commons Attribution 4.0 International
32 stars 10 forks source link

The lifetime of software versus data #6

Closed npch closed 7 years ago

npch commented 8 years ago

There's a statement that says:

"The lifetime of software is generally not as long as that of data"

I think that's a tricky one to justify, because it appears to be more influenced by the field than by whether the bits are software or data.

For instance, in nuclear safeguards and civil engineering, both software and datasets are long lived, with specific older versions likely to be used for periods of time.

In other areas, data and the software used to store and analyse it are intrinsically linked, and evolve through time e.g. entomology databases collecting observations or infrared astronomical observations.

Finally, in the case of areas like gene sequencing, both the data and software are rapidly superseded as higher definition equipment becomes available, leaving relatively short lifetimes for both software and data to be useful.

Some case studies are collected as part of https://www.software.ac.uk/attach/SoftwarePreservationBenefitsFramework.pdf

danielskatz commented 8 years ago

I suppose the reason I wrote this originally is that at least some data (physically collected) may be useful forever, and many research problems are hampered by the lack of data from the past. On the other hand, it's not clear that any current software will be useful (or even work) 50 years from now.

I do agree with you, however, that the distinction is not so clear cut overall.

How can we resolve this? Just drop this as a difference? Explain it in much more detail?

npch commented 8 years ago

I would vote for dropping it as currently phrased, as I think different people will raise objections in different ways.

I do think that there is something that you're getting at - maybe it's that old software is much more rarely used in that state (it generally evolves), whereas old data may be. I'm not sure how much of this is covered by the bit rot example.

Oh and 50 years is an interesting timescale, since there are a number of pieces of software which originated in the 1960s which are still used today (albeit rarely in their exact 1960s form). However the idea that was prevalent about 10 years ago that most would not work turns out to be nuanced - most will not work on modern systems, but work fine on emulations of the historical systems. So software breaks moving forward, but can be preserved remarkably well.

danielskatz commented 8 years ago

@knarrff made an attempt to make this better. Since I think there is something to this point, even if I am not explaining it well, I hope others can bring out the correct part while removing the confusing parts.

danielskatz commented 8 years ago

adding comment @jennielarkin (from #11): I would like quantitative proof of this one: this seems like someone’s assumption. Some data is long-lived and ought to be sustained for a long time. However, that is not necessarily the case for all data. One can easily say the same thing for software. I suspect there is a range for both data and software.

band commented 8 years ago

Following up on @jennielarkin query regarding information and evidence regarding the lifetime of scientifc data, a 1995 NRC Report "Preserving Scientific Data on Our Physical Universe" (http://www.nap.edu/catalog/4871.html) provides the following recommendations regarding retention criteria and the appraisal process (p. 40): "As a general rule, all observational data that are nonredundant, useful, and documented well enough for most primary uses should be permanently maintained. Laboratory data sets are candidates for long-term preservation if there is no realistic chance of repeating the experiment, or if the cost and intellectual effort required to collect and validate the data were so great that the long-term retention is clearly justified. For both observational and experimental data, the following retention criteria should be used to determine whether a data set should be saved: uniqueness, adequacy of documentation (metadata), availability of hardware to read the data records, cost of replacement, and evaluation by peer review. Complete metadata should define the content, format or representation, structure, and context of a data set."

So, perhaps, it is better to avoid generalized statements and provide some context for the claims that software has a shorter lifetime. I think the lifetime property is not as simple as stated. I do not have a better formulation right now.

danielskatz commented 8 years ago

Thanks - I'll add this under evidence

danielskatz commented 7 years ago

Does anyone want to further modify this? I don't want to completely remove it, but I'm also not fully satisfied with it.