Document citations of "wrapped" software

There is a vague boundary between "things that we should ask users of a certain command to cite" and "things that we don't need to ask users of a certain command to cite".

For example, strainFlye align uses minimap2 to do alignment; should we ask users of this command to cite minimap2 in addition to strainFlye? To me, the answer seems like a pretty solid yes. It doesn't seem fair to me to say that strainFlye is "doing alignment." The next question: do we ask people also to cite other dependencies of strainFlye align: samtools, pysam, scikit-bio, Click, ...? I'm not sure that there is a clear agreed-upon answer to this question. (We already cite this stuff in the paper, but the question is if users of strainFlye should also cite it.)

Both "extreme" options (option 1: don't bother asking users to cite any of the dependencies, not even e.g. minimap2; option 2: ask users to cite literally everything, down to the python modules) don't seem practical. So the ideal solution (at least given how citing things works in academia right now; ideally we'd have some automated system set up that would "give credit" to the devs of every dependency and module we use, even incidentally) is proooobably somewhere in the middle.

To me, one approach that seems reasonable is defining a difference between "wrapping" a tool (e.g. using minimap2 to do alignment in strainFlye align—the main output of this command is this alignment, albeit after some filtering) versus just "using" a tool to do something routine as a small part of something else (e.g. using samtools or bcftools to index, sort, etc. a file we created through other means). If we focus on the "wrapped" software as that which should definitely be cited when its corresponding strainFlye command is used, then there are two obvious instances of this: minimap2 for strainFlye align (the main output of this command is the minimap2 alignments), and LJA for strainFlye smooth assemble (the main output of this command is the LJA assemblies).

The use of Prodigal in strainFlye fdr estimate (we use Prodigal to predict genes, then we use these genes to adjust the decoy context—e.g. focusing on CP2 positions) is a gray area, but I'd lean towards it being more on the "routine" side of things since the resulting gene predictions are only used for a small part of a larger task and aren't output on their own. There's not a big risk of someone coming away with the impression that "we used strainFlye to predict genes," for example.

Using this distinction seems reasonable to me, and I think it matches the general state of affairs when people cite pipelines nowadays.

Anyway, given these associations between commands and dependencies: the elegant QIIME 2-ish way to handle this is setting up something that associates each command with a set of citations (e.g. a --citations option that can be run for each command). But QIIME 2 is extra nice because these citations are bundled in with the QZA/QZV outputs of QIIME 2 commands, and we don't have that luxury here. The caveman grad student way is just documenting recommended citations in the README, so let's go with that.

fedarko / strainFlye

Document citations of "wrapped" software #3