WSSSPE / meetings

Products from Working towards Sustainable Software for Science: Practice and Experiences (WSSSPE) activities
Other
17 stars 34 forks source link

Credit: Hacking the credit and citation ecosystem (making it work, or work better, for software) #51

Open danielskatz opened 9 years ago

kyleniemeyer commented 9 years ago

These are just some brainstorming thoughts—where, by definition, no ideas are bad...—but one way to hack the existing citation ecosystem would be to introduce a hierarchy (or >1 level, at least) of citations. Right now, all citations in a paper are treated equally, but clearly certain references (such as those for software) contribute more.

For example, if a study relied on a particular CFD code to obtain all of their results, arguably the study wouldn't exist without that software and thus its creators should get more credit than a standard citation. As it is, that sort of "vital" citation isn't recognized any differently than any other.

So, perhaps one way to make the current system work better for software would be to introduce a new "substantial" or "significant" citation category to indicate the greater importance/dependence on products such as software and data. Then, looking at the counts of these sorts of citations would help others see the associated contribution to the field/science better than existing citation metrics. Also, this could tie in to the proposed transitive credit schemes.

Of course, this may not make much of a difference for folks on the extremes of the spectrum... software or data that others aren't using won't see any changes, and highly cited software packages already get a good deal of credit with large numbers of citations.

danielskatz commented 9 years ago

There's some overlap between this idea and both my transitive credit idea (http://arxiv.org/abs/1407.5117) and the Project Credit work (http://projectcredit.net), though it's not the same as either. In transitive credit, contriponents are given a weight, and in project credit, there has been discussion of contributions being at one of two or three levels, though this is just for contributors, not citations.

kyleniemeyer commented 9 years ago

I wasn't aware of Project Credit—that's interesting, thanks. Unless I'm mistaken, it looks like that is primarily (if not entirely) focused on assigning credit to people serving in various roles on a publication.

However, now that I take another look at your transitive credit ideas, I do see that both people and products (software, data, etc.) are assigned credit percentages. (This may be slightly off-topic for this particular discussion, but I do wonder about how someone writing a paper would divide the credit between their efforts and a software package they used.) Certainly, transitive credit would give a more quantifiable measure of credit compared to my idea.

danielskatz commented 9 years ago

none of these ideas are isolated, and none are completely satisfactory, at least to me. But I would like to push transitive credit further, and I think it could be merged with (or overlaid on) project credit. On the other hand, as you say, there are some questions about the details.

kyleniemeyer commented 9 years ago

Running with transitive credit for now, it seems like it should be possible to take existing citation relationships (directed graphs, really) and the associated ecosystem, and apply the credit map for each paper. However, I suppose that would require applying some credit percentage to each citation in a paper... unless the existing citation system remained complementary to the credit system.

danielskatz commented 9 years ago

right, and weights could just be autocalculated as even to start (10 citations -> each gets a 0.1 weight)... Ok, authors also get even weights... Maybe authors get 0.5 divided evenly, and citations get 0.5 divided evenly. There probably would have to be defaults like this to start with in any case, even if the submitter was making changes later.

kyleniemeyer commented 9 years ago

Yeah, that essentially overlaps with my original suggestion, but in a more quantifiable way and allowing as many hierarchies in citations as you want. I agree that this could be overlaid on project credit, where the different role classifications could relate to default (different) credit weights.

Defaults are a good idea, although I imagine there would be some pushback on splitting the credit 50-50 between authors and citations—but of course an author could change as needed.

danielskatz commented 9 years ago

I don't know what the right default is - 50/50 is just the default default :)

knarrff commented 9 years ago

Credit for authors and citations are really two different things. The first is credit for that particular paper, the latter for previous work, and everybody understands it that way I think. I would leave it at that. Comparing the two isn't really possible, and trying to not a good idea imho. Otherwise it is too easy to give the authors 100%, just to boost their credit.

danielskatz commented 9 years ago

I don't really agree. If you think about what should be given credit for a new product (whether a paper, software, data, etc.), it's both the people who directly contributed to it as well as all the other products that were needed to make it work. In my version of what we should try, the person who registers the new product should weight the credit for both the people and products. Of course, all contributors should agree.

kyleniemeyer commented 9 years ago

I agree that it could be difficult to compare the contributions of, for example, the authors or the software they used to perform a study. I suppose that could lead to bigger philosophical questions about the capabilities of a tool vs. the novelty of what is done with it.

However, as it is, for papers the current citation system severely underrepresents the importance of software in particular by only giving them a citation at the same level/weighting as any other citation in the paper. I think many would agree that isn't fair for studies that relied heavily on software that someone else developed—going back to my CFD example, the study wouldn't be possible without the software developed by others, so they should get some credit for their contribution (which arguably could be on the same level as an individual author).

hlapp commented 9 years ago

I'm not seeing how software is different in this regard from prior science without which a study wouldn't have been possible either. Or experimental protocols. Are we suggesting to ask authors to draw some kind of line between the science and materials without which a study wouldn't have been possible, and those it could have done without? Isn't that pretending that scientific advance can be viewed as following a tree of derivation, rather than being the result of an interwoven network?

kyleniemeyer commented 9 years ago

Well, I think there is a difference between building on past work, and directly using someone else's software.

The analogy on the experimental side of things would be to use someone else's experimental equipment to do your study—not to build your own setup based on what they described. In such a case, at least in the papers I've seen, typically the owners of the equipment being used show up as authors of the paper!

So, they are clearly getting more credit than a simple citation, which is what creators of software being directly utilized are getting right now. I agree that if you develop your own software (or experiment) based on past work then a citation is appropriate—the issue is when you are using something directly created by someone else.

johnwcobb commented 9 years ago

Re: experimental equipment use implies co-authorship: I do not think this is universal but rather varies by field and often within a field.

In a wider discussion of credit (transitive or otherwise) one might hope that in addition to plowing new ground we could also help standardize existing practice. As an equipment provider should I expect acknowledgement, citation, or co-authorship? As an instrument user, I should clearly know what is expected of me. If we can communicate clearly then it is probably a win.

kyleniemeyer commented 9 years ago

Yeah, that is certainly based only on my experience (and might not even be consistent in my field).

I definitely agree that what we're discussing could equally apply to equipment use as software use, if it's some unique experimental setup... I imagine you'd have to draw the line somewhere, for example, at standard, easily available equipment or software tools (e.g., compilers).

dangunter commented 9 years ago

@kyleniemeyer, I agree it's unique capability that is important. To me, it can be refined by saying that unique capability needed to reproduce the results -- not just for convenience, or personal preference -- should get a gold star.

In the future when more publications are living recomputable records, perhaps the act of swapping certain components and not others in order to reproduce the results will make this less of a subjective call and more a directly measurable attribute.

sctchoi commented 9 years ago

The main thing from my point of view is that scientific software is not being sufficiently cited. Can we create a software citation website, similar to say ResearchGate but focused on scientific software, that is capable of automatically parsing existing papers and software files for mention of software and generate the bibtex, JSON-LD entries, and corresponding statistics of citation or credits (e.g., software-specific citations, h-index, i10-index)? In such a process we can build an expanding archive of scientific software entries over time and perhaps foster a culture of crediting important software in scientific/engineering communities.

dangunter commented 9 years ago

@sctchoi I would start with quantifying that assertion based on some data. How much is software cited now? How does that vary by community? What is a working definition of "sufficient"? This kind of study could also look the "dark matter" of publications that don't cite or mention software they use (which I hope would be corrected in the review process, but I would like to see proof).

hlapp commented 9 years ago

I agree this would need some supporting data. Those may also show that the citation behavior is probably highly uneven across fields. For example, in biology, the by far most cited papers in almost all its fields are about a scientific software. (This still doesn't have to mean that all scientific software in biology is sufficiently cited. But where software is under-cited is far from as obvious as your assertion makes it sound.)

jameshowison commented 9 years ago

Hi all,

Two things to add here. First, you'll find some empirical data on what software is cited in biology in this publication, see Appendix B (Table 9). Clearly, though, the sample size (90 articles) was not large enough to build any real league tables etc (notice that the most frequently cited software only showed up 4 times and most packages only showed up once in the sample).

Howison, J., & Bullard, J. (in press). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST). http://doi.org/10.1002/asi.23538

Second, the data set is available from the JASIST paper (the first paper above) and I am starting to use it to do machine learning to recognize software mentions, but I'd be very interested in helping others use both the existing coded data and the content analysis scheme (to increase the size of the gold standard set and to extend to other fields). The original publications as pdf (and text) and the coded dataset are in the softcite repo and the dataset is here:

https://github.com/jameshowison/softcite/blob/master/data/SoftwareCitationDataset.ttl

--James

btw, I'd love to extend this and find out what software was used but not cited, this paper has a start toward that (and a few other things) for 3 papers, using interviews with the authors/postdocs etc. In short, there was an awful lot not cited, especially once one thinks of the dependencies of the initial packages. I never actually quantified the amount of packages that showed up in the interviews but not in the paper, but I could, if people thought it would be worthwhile.

Howison, J., & Herbsleb, J. D. (2011). Scientific software production: incentives and collaboration. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (pp. 513–522). Hangzhou, China. http://doi.org/10.1145/1958824.1958904

On Mon, Sep 28, 2015 at 8:17 AM, Hilmar Lapp notifications@github.com wrote:

I agree this would need some supporting data. Those may also show that the citation behavior is probably highly uneven across fields. For example, in biology, the by far most cited papers in almost all its fields are about a scientific software. (This still doesn't have to mean that all scientific software in biology is sufficiently cited. But where software is under-cited is far from as obvious as your assertion makes it sound.)

— Reply to this email directly or view it on GitHub.

danielskatz commented 9 years ago

google doc for notes at WSSSPE3 - https://docs.google.com/document/d/1oN0ZYqIoWtOE1LBMIlWY9N8nn5LHTncj8GjUKPh62pA/edit?usp=sharing

kyleniemeyer commented 9 years ago

Note that the plan is to merge this group with the FORCE11 Software Citation Working Group.