ReScience / ten-years

Ten Years Reproducibility Challenge
BSD 2-Clause "Simplified" License
62 stars 5 forks source link

Questions #2

Closed rougier closed 4 years ago

rougier commented 5 years ago

If you have any question concerning the challenge, you can use this thread.

p16i commented 5 years ago

How can a new researcher contribute to this campaign? For example, he/she might start his/her research career; hence, not code older than 10 years.

khinsen commented 5 years ago

@heytitle Here is one idea: you could identify interesting papers in your field, and contact their authors to encourage them to participate in the challenge.

rougier commented 5 years ago

For new researchers, we also intend to have a repro hackaton in Bordeaux some time next year (with @annakrystalli) where you can try to reproduce papers from the litterature. We'll have a special issue linked to this repro hackaton. And @annakrystalli is also organizing several other repro hackatons.

rougier commented 5 years ago

Would be nice to have an entry by Margaret Hamilton stating code is available but hardware is nowhere to be found :)

bpbond commented 4 years ago

Looking through #1 most people seem to be moderately to highly confident of success. It occurs to me there's probably a degree of self-selection at play, with people picking studies they're relatively confident of replicating. (Not sure, speculating.) Anyway, the degree to which this isn't a random sample should probably be addressed somewhere in the special issue.

rougier commented 4 years ago

You're right and one of the problem may be the impossibility of finding sources. There is a proposal for a collective article giving account of failed replication such that people can quickly explain why they failed without the necessity of writing a full article for just explaining they could not find the sources (for example). But apart from that, I'm not sure how to address this bias. We'll underline it in the editorial for the special issue. Note that the bias is also true for regular replications. So far, we've only published successful replications.

khinsen commented 4 years ago

Self-selection is an eternal problem with publication. It starts even earlier: people who know about ReScience are already a self-selected minority of scientists interested in reproducibility questions. For doing statistics on reproducibility, we'd have to do something like a poll of a random selection of researchers, not a call for contributions. BTW, I know someone who has been wanting to do exactly that for a few years, but never got around to actually do it (lack of funding etc.).

kyleniemeyer commented 4 years ago

@rougier @khinsen just to clarify, what is the deadline for the article/reproducibility report associated with the challenge?

khinsen commented 4 years ago

@kyleniemeyer April 1st (see https://rescience.github.io/ten-years/)

ev-br commented 4 years ago
  1. Is there a deadline for the entry? (as in, is 8 Jan 2020 still OK for declaring the participation?)

  2. I am nearly certain I won't be able to run the whole set of simulations from the 2006 paper because I won't be able to justify the use of that much computational resources (it was fairly substantial back in 2006 and the machine was a vector Cray, and we specifically targeted vector machines). Is it still OK to target a representative subset? (if this runs, the rest also runs, only requires some more CPU time).

khinsen commented 4 years ago

@ev-br The only deadline is April 1st for submissions. You can declare participation in the morning and submit in the afternoon if you like!

A representative subset looks reasonable, just be sure to state this in your submission. It will then be the reviewer's job to decide if it's representative enough.

brembs commented 4 years ago

Just noticed in the FAQ that the code has to come from "myself". What does that mean specifically? Here is my case: 1) The code I used to collect the data (TurboPascal) already existed before I started in the lab and I modified the code. 2) The code to analyze the data and export the derived results into a spreadsheet was written in C++ (MFC) by a grad student with whom I shared the room and I also contributed a little to writing it. I have the source code and the executables still run (from around 2000). 3) As the software used to visualize the derived results was proprietary and I don't have it any more (not sure Statistica still exists?), I have written a short R script that takes the spreadsheet and creates bargraphs.

So this means that the raw data (in our own local format written by Turbo Pascal) that I collected myself can still be read by the C++ analysis software (written largely by someone else) and with my own, current R code, I can show that the data produce the exact same graphs as in the original publication from 2000. Does that mean I qualify or is my contribution too little?

rougier commented 4 years ago

I think the FAQ might be a bit too restrictive. The idea was to test the (original) code you used in your article. I don't think we meant you have to have written everything yourself so from my understanding about your explanations I think you're good to go. You might need to add the above explanation in your article, especially for the current R code / Statistica (if you can trace the history of this software that might be even better).

khinsen commented 4 years ago

I agree with @rougier's point of view. The point of the challenge is to let scientists evaluate the possibility of re-doing work they published in the past. It isn't so important who wrote exactly what code - what we are after is reports of how future-safe the methods of the past turned out to be. So if you consider your case unusual compared to the others, or to the description of the challenge, that means most of all that you should describe the particularities in your article.

brembs commented 4 years ago

I've organized the code and the data according to the descriptions. Now I'm preparing to write the paper and I'm reading these two descriptions: https://github.com/ReScience/ReScience-submission https://github.com/ReScience/submissions Do I understand this correctly that I have to install python and that I have to find out how to generate LaTex documents using make and the templates provided (I've used Overleaf before and find it tedious, cumbersome and more time spent with the system than with the text)?

khinsen commented 4 years ago

@brembs You can prepare your article with whatever tools you like, provided that you can produce a PDF file for submission in the end. We also need the YAML file with the metadata. There is no requirement to use our template, which so far is LaTeX-only because that's what we know best.

BTW, the first repository you cite is now obsolete. Use only https://github.com/ReScience/submissions.

brembs commented 4 years ago

Excellent, this is not a problem!

weinman commented 4 years ago

(Apologies for such a simpleton question for a newcomer to this area.) I will likely have results that are statistically similar to what is published, but the exact results are not the same. I would naively call that a successful replication, though it's not necessarily a precisely repeated/reproduced experiment where the numerical outcomes are identical. Instead I would say the trends are the same and the conclusions hold up.

In my submission should I classify this as Rp or ¬Rp?

khinsen commented 4 years ago

Ultimately that's something you should discuss with the reviewer(s) who have read and commented your paper. But as you describe the situation, it looks more like a success than a failure to me.

broukema commented 4 years ago

Is there any word limit on the abstract? Since the template doesn't generate any warning to the user, I suspect that the answer is "Formally, no, but be reasonable." In case it was missed - @khinsen, @rougier - over at https://github.com/ReScience/ten-years/issues/9 there are me + three others hoping for a bit of an extension of the deadline.

In my case, you can see that my text is nearly ready for submission, along with the code at codeberg - unfortunately with a negative result, and plenty of feedback on the difficulties.

khinsen commented 4 years ago

Deadline extension: see https://github.com/ReScience/ten-years/issues/1#issuecomment-622891648

khinsen commented 4 years ago

@weinman That's hard to say in the abstract, so it's something you should discuss with your reviewer. Pick any one for your submission and change it later if appropriate.

broukema commented 4 years ago

@rougier @oliviaguest @khinsen @pdebuyl or any of the contributors: Which computer science subject class on ArXiv would make most sense for the Ten Years Challenge papers? - https://arxiv.org/corr/subjectclasses

I'm not really convinced by any of these. There's also:

I think it would be good to encourage all of the Ten Years' Science Challenge authors to post our papers as (p)reprints (either before or after acceptance, including the final accepted version) on ArXiv or possibly BiorXiv. Acceptance on ArXiv is not guaranteed: if the ReScience C Ten Years' Science Challenge papers are accepted by the ArXiv moderators, then this will be in favour of "replicability/reproducibility science" and the journal itself being given wider recognition by the scientific community.

I guess my tendency would be to choose DL - these articles are contributing to the concept of digital preservation of scientific research papers in a deeper sense that that of the human-readable pdf.

Any arguments for/against any of these (or other) options would be welcome! :)

oliviaguest commented 4 years ago

I am not bothered either way and have no opinion formed at the moment, but I'm curious: does posting on arXiv offer more archiving possibilities (because they are picked up by Google Scholar, for example)?

bpbond commented 4 years ago

As you note, there are links to many of those categories, but DL seems the best/most logical to me.

broukema commented 4 years ago

@oliviaguest ArXiv is an archive, as can be guessed from the name :). It's nearly 30 years old, so in terms of longevity, it's clearly stable. Moreover, (i) it provides a uniform, community-based way of collecting together papers by an author in the physics/astronomy/maths/statistics/computer-science area of scholarly studies; and (ii) the highly standardised and well-recognised use of ArXiv:yymm.nnnnn identifiers (with one change in 2007) shows to a human or robot reader of the bibliography of a paper that that particular reference is necessarily available under green open access.

It's much more motivating for a human to go immediately to an open access reference than to spend extra time finding the URL of a reference whose access type is unknown. Robots' decisions of where to explore are also made easier this way.

oliviaguest commented 4 years ago

@broukema OK, I think we're having a weird miscommunication. I have used arXiv, have preprints on there, etc. I just mean why specifically are we using it here? Maybe I've missed something above, but don't we usually use Zenodo for ReScience C?

broukema commented 4 years ago

I'm not proposing ArXiv as an alternative to Zenodo for archiving ReScienceC pdfs; ArXiv is complementary to Zenodo, and stores the sources of papers, not the final pdfs (it caches the pdfs for some time).

I'm rather thinking of bibliometry of scientific articles, and general efficiency and modularity of scientific communication. See (i) and (ii) above. There's also the fact that I mentioned above, that ArXiv moderators provide a qualitative filter of minimal scientific quality as a research article, that, I presume, is not done on Zenodo (I haven't used Zenodo much, so I'm not sure).

There are quite a few differences between ArXiv and Zenodo which I haven't mentioned above:

I certainly intend to post my article on ArXiv, and whether or not other authors wish to do so, or whether or not the editors as a whole choose to make this as a recommendation, is up to the authors and editors to decide: my arguments are above.

My feeling is that if a large fraction of ReScience C articles are accepted by ArXiv moderators as valid scientific research papers, then that will help strengthen the reputation of ReScience C as a serious scientific journal.

khinsen commented 4 years ago

@broukema The question of how to manage / archive our articles comes up from time to time, we are definitely not in a stable state. Currently we only use Zenodo for archiving the PDFs, and increasingly Software Heritage for archiving code repositories.

One problem with adopting a more feature-rich but also more specialized platform such as arXiv is the heterogeneity of our submissions. ReScience C covers (in theory) all domains of computational science, which have widely differing habits. Most authors use LaTeX and our article template, but this is intentionally not obligatory. Likewise, many scientific disciplines are represented in ReScience C but not in arXiv. We could certainly agree among ourselves to have authors submit all ReScience C submissions under DL, but it's the arXiv curators who have the last word on the choice of category and I have no idea how they deal with articles they consider out of scope.

I'd rather start from the other end and ask: what do we want to improve compared to our current system? A frequent request is indexation by Google Scholar, which hasn't made much progress mainly because Google doesn't have clear rules for that. You suggest archiving the source code, which is interesting as well and can be realized in many ways. The most interesting aspect of arXiv that you point out is curation. This could indeed increase ReScience's reputation in the domains covered by arXiv (which is a small minority of our contributions so far), so I think it's definitely worth considering as a recommendation to our authors - but then they should ideally submit in their respective arXiv categories.

rougier commented 4 years ago

Also, compiling latex on arXiv is kind of a nightmare if you don't use the exact same version as they use (which is no really up to date).

broukema commented 4 years ago

@rougier I've been posting to ArXiv for several decades - I vaguely recall minor LaTeXing problems only once or twice, and I assume that the version of LaTeX that I use - normally the Debian GNU/Linux stable or sometimes oldstable version - has almost never been identical to the one provided. So our experience differs here. I use ordinary source debugging (git history, binary search) to debug LaTeX errors, but more user-friendly and powerful LaTeX debugging tools exist.

@khinsen I'm not convinced that the astro-ph.CO moderators would see my article as a cosmology research article, since it's really at a meta-meta-level compared to cosmology - it's about methodology (a case study of a method) - with no cosmological result. But I think that astro-ph.CO (in my case) is worth trying as a secondary category in addition to cs.DL. The moderators in other specialties will each make their own judgments - it's certainly reasonable to try. The individual moderators' decisions are not public, but their names are public, so a systematic refusal could in principle be later raised for wider public discussion.

oliviaguest commented 4 years ago

Did we lose track of the original question?

Which computer science subject class on ArXiv would make most sense for the Ten Years Challenge papers?

@broukema go for DL, I think — why not? ☺️

broukema commented 4 years ago

@oliviaguest We did digress a bit :). But I think that Konrad's point about considering the scientific topic of the original paper is a fair one - ArXiv normally allows a secondary category, and that to me seems reasonable. I'll wait until either the 03 or 04 step of the editorial process before posting my paper on ArXiv - so there's still time for anyone else interested to provide other suggestions/arguments. I've got a 02 label, and 03 is probably not far away ;).

broukema commented 4 years ago

Just to clarify the crosslink: I proposed cs.DL (primary) and astro-ph.CO (secondary); ArXiv moderators took three weeks to decide and accepted my article in cs.CY (primary) and cs.SE (secondary): https://github.com/ReScience/submissions/issues/41

Others intending to submit their papers to ArXiv should probably consider choosing cs.CY and cs.SE immediately rather than waiting for reclassification.

khinsen commented 4 years ago

Thanks @broukema for reporting on your arXiv experience!