Implement data products being added from external objects such as papers

soniamitchell commented 3 years ago

Once adding papers to the registry with the CLI is implemented (and config.file format is agreed upon), we can implement this.

richardreeve commented 3 years ago

I have (what I consider to be!) very silly/exciting ideas with this one - can we come up with a way of recursing through papers and their references using the citation information (if that information is available programmatically) to put the whole dependency tree of a paper into the system (up to a certain depth, perhaps)? Can we also add in the supplementary materials and so on, so you actually have the whole paper and its extra bits referenced sensibly in one place? Would that be useful if we did?

soniamitchell commented 3 years ago

Lol~ that actually sounds fun! Of course I need fair pull to implement paper imports first…

richardreeve commented 3 years ago

Yes, I'm totally with you on the on hold designation of this one...

richardreeve commented 3 years ago

As a thought experiment though, how would we turn a code run into something that would make sense for paper citations too while still doing its main job?

soniamitchell commented 3 years ago

I find it much easier to work this kind of stuff out during implementation. That way I can see the context, use an example, and see any problems which might arise. But I’ll play along..

My first thought would be to register papers in the same way issues are added. That is, via a script with no additional fields in the config file. Of course that means the DP API would need to register them. Why are external objects registered in pull again? Was it because the DP API should be able to run offline? If so, there should be a problem.

I would then include an optional DOI argument in write_array().

I’d also add this to milestone 1, since adding data from papers seems pretty basic.

richardreeve commented 3 years ago

I see what you mean, and maybe we could do things that way (though there's definitely no time to add it to the 1.0 milestone!), but actually that's not quite what I was thinking of...

What I meant was that connecting papers and their citations involves no github repo, and potentially involves no config files or run scripts either, so if we want to use the code run registry table to make a paper an output and its references inputs, then we have to think about how that would work in the context of that table, or whether we would want to add a new table that specifically describes references rather than inputs.

The other use for this I've been thinking of for a while was being able to reference papers inside your code - so if you're implementing an algorithm or using a package in a file in your repo, you could cite it (using some clever syntax) at the point where you write that piece of code, and then (somehow, magically!) the pipeline will automatically pick up the dependencies and add the citation to the inputs / references list.

I appreciate I'm getting completely off topic here now, but I do wish it was easier to make sure that everything was correctly credited when I'm writing code... anyway, this isn't remotely high priority, I just thought it might be interesting to contemplate. You're probably right that it's easier to wait until we're actually trying to implement it.

soniamitchell commented 3 years ago

Ahh.. I assumed you meant more generally.

I’m not sure how useful having references as inputs and papers as outputs would be? Also not sure about adding references whilst coding. You might need to take me through that, but I’d be hesitant to add too many pieces of functionality that aren’t common use cases.

Doing an analysis / making a cool visualisation from the references sounds fun though. I’d be interested in that.

FAIRDataPipeline / rDataPipeline

Implement data products being added from external objects such as papers #94