datalad / datalad-paper-joss

Repository for JOSS paper on DataLad
MIT License
3 stars 28 forks source link

Select title #9

Closed mih closed 3 years ago

mih commented 3 years ago

Note mandated, but ": " is a common title pattern in the journal.

Candidates from various source

clarification: :-1: is the refinement of :tada:, and votes for :tada: will be added to :-1: (unless double-voted). I encourage those who voted for :tada: revote for :-1: if they agree, and if you don't - please comment to support your choice of :tada: over :-1:

yarikoptic commented 3 years ago

Added a candidate we defended recently and icons for voting (could be multiple)

adswa commented 3 years ago

decentralized versus decentral versus distributed?

decentralized sounds off in my ears

mih commented 3 years ago

decentralized versus decentral versus distributed?

decentralized sounds off in my ears

Good point. Git also used "distributed". From git(1):

Git is a fast, scalable, distributed revision control system

yarikoptic commented 3 years ago

I was following https://www.degruyter.com/document/doi/10.1515/nf-2020-0037/html where no author seemed to raise a red flag in choosing "decentralized". @mih - what was your guide for choosing "decentralized" in favor of "distributed" there? Staying consistent with the title/dRDM in that paper would IMHO be a bonus, although if it was severity flawed, I am ok to "generalize" into "distributed":

Looking at https://medium.com/distributed-economy/what-is-the-difference-between-decentralized-and-distributed-systems-f4190a5c6462 I think "decentralized" fits somewhat better than distributed as to reflect the most common use cases, although "distributed" reflects the technology underneath -- that git/git-annex/datalad indeed allow for a more distributed mode of operation.

mih commented 3 years ago

There was no particular rational or drive behind "decentralized". Given the labling used by Git, I would have preferred to have made a different choice. As usual, I am also no believer in sticking to mistakes of the past ;-)

Re the comparison of the terms in the linked article: I think "decentralized" better fits actual usage patterns, but distributed is more appropriate for describing the technological capabilities. I suspect that the decentralized usage is largely driven by a deeply embedded concept of mine vs theirs.... we shall overcome ;-)

yarikoptic commented 3 years ago

As usual, I am also no believer in sticking to mistakes of the past ;-)

as with any kind of a "release" it might later become considered as buggy as the prior one ;) and

“You become responsible, forever, for what you have tamed.”

― Antoine de Saint-Exupéry, The Little Prince

Overall -- I am fine with either, although leaning to "decentralized" for consistency and better reflection of the typical usage patterns. I guess the vote(s) hopefully would help us make the decision.

dorianps commented 3 years ago

You guys probably have thought long about the title, but in my view the options above stress too much the concept of "distributed" or "decentralized" management, while the main feature I think datalad provides for all users has to do with data "tracking" or "versioning". Something like "collaborative distributed tracking" might reflect my perception of Datalad. Whether data management is distributed or not depends on the use case, i.e., some users may appreciate the distributed nature of datasets, others may use a centralized repository or even not collaborate with anyone and still benefit from the core data versioning (and time travel flexibility) of Datalad. Just my honest, shameless thought :)

yarikoptic commented 3 years ago

Thank you @dorianps for the feedback. Indeed, I think we somewhat missed "versioning" aspect entirely, as if it was given. "tracking" is somewhat implied by "decentralized" or "distributed" but not obvious, but it isalso unclear on its own so not sure if appropriate for a title.
Indeed it is hard to embed all possible features/use-cases into a single title. Makes me appreciate the official (in manpage) description of git ("the stupid content tracker") once more.

dorianps commented 3 years ago

Just throwing an idea (without contributing a single line on the code):

Datalad: collaborative data tracking, transferring, and management, across multiple platforms

Platforms = non-specific catch all (linux, windows, git, uk biobank)

May still work if collaborative is replaced with distributed.

bpoldrack commented 3 years ago

I think "decentralized" better fits actual usage patterns, but distributed is more appropriate for describing the technological capabilities.

This. Which is why in software journal my vote is on "distributed" as far as it refers to Datalad.

Whether data management is distributed or not depends on the use case, i.e., some users may appreciate the distributed nature of datasets, others may use a centralized repository or even not collaborate with anyone and still benefit from the core data versioning

True, too.

Datalad: collaborative data tracking, transferring, and management, across multiple platforms May still work if collaborative is replaced with distributed.

Guess the cross-platform aspect can be left out of the title. If no platform is mentioned in it, we don't need to fight a possible impression that it's platform specific. Moreover hardly any VCS is.

So: Datalad: distributed versioning and management for research data ? May be even "large data" instead of "research data". It's agnostic after all and while we might want to draw particular attention from scientific community, JOSS may be more useful for us if we get developers (potential contributors) interested with completely different usecases.

dorianps commented 3 years ago

@bpoldrack Your version looks good to me, too. Two thoughts:

  1. Research data sounds like a complicated tool for researchers only. I once read a post at git-annex with someone keeping inventory of DVDs using annex. Datalad can be used for any data, research or not.
  2. I thought one of the greatest strengths of Datalad is seamless transfer between platforms, i.e., going from linux to windows, from hard drive to usb, from local to cloud, etc. Those multiple transfer options are what makes it a universal tool for collaborations, that's why I included in the title, but even without it, management can still cover that aspect in a less specific way. So your title is good.
yarikoptic commented 3 years ago

re "large" - not necessarily, since could be used for management of "sensitive" (licenses, personal data, etc) data

re "management for research data" - it is captured better IMHO already by a Research Data Management (RDM) which is a known concept. So the discussion seems to be just still circling back to which critical features to somehow include in the title to characterize such RDM better. But it seems the currently leading choice of the title even doesn't mention "research" aspect ;)

leej3 commented 3 years ago

I thought i'd echo @dorianps opinion. I appreciate that history shows the power of distirbuted over centralized... noone wants to go back to SVN but that's not what excited me most about Datalad. With the next big thing being things like MLops and other such buzz words I wonder whether emphasizing Datalad's ability to become a core tool beyond research science would be valuable. Something like:

Datalad: A foundation for managing code, data, and environments

Assuming another choice is the last thing everyone needs I've voted on the suggestions though : )

bpoldrack commented 3 years ago

Datalad: A foundation for managing code, data, and environments

I really like that take.

mih commented 3 years ago

Me too! Thx to @dorianps and @leej3 for your perspective. I think we should consider this aspect for title and manuscript focus.

yarikoptic commented 3 years ago

I think that the "foundation" aspect should indeed be verbalized in the paper. But

yarikoptic commented 3 years ago

I think that the "foundation" aspect should indeed be verbalized in the paper.

https://github.com/datalad/datalad-paper-joss/pull/34 is a possible "lean" injection of the foundation aspect. I guess there could be other places where it could be injected, but I do not think that the JOSS paper would be the best venue to center on "foundational aspect" of DataLad.

leej3 commented 3 years ago

@yarikoptic “Foundation” felt more dramatic and inspiring but I agree it falls a little short in that it hints Datalad is not your all encompassing solution to handling these problems. I feel platform has been over-used because of the stuff in the cloud. I can’t think of a better choice though. It fits well. I like your alternative title.

Throwing out some other ideas in increasing order of absurdity in case one sticks or triggers an alternative in someone else’s head: “Approach”, “system”,“Core-tool”, “comprehensive toolkit”,”ecosystem”,”vision”, “armamentarium”, “panacea”

mih commented 3 years ago

What about "bedrock"?

I also like "infrastructure (tool)", but it shifts the focus away from the individual user.

A little esotheric: "digital companion for joint management of code and data"

yarikoptic commented 3 years ago

A little esotheric: "digital companion for joint management of code and data"

;-) it would have been nice to finally bring DataLad from it soulless form to reflect on its name origin of a curious youth as an alternative to "a person who is deemed to be despicable or contemptible" .

pvavra commented 3 years ago

Following up on previous points about what got us excited about datalad: For me it was the provenance tracking. That features of datalad is, as far as I can tell, really unique - and missing from the title.

Specifically, what is missing is that we can track which code changed/created which data. Atm the "joint management" could be read as "managing in one place", but, in a sense, in parallel/independently. I think the provenance aspect could be emphasized a bit more explicitly.

Riffing on the last title, something like

DataLad: distributed system for joint management of code, data, and their relationship

(that sounds a bit clunky to my ears.. but just to give an example of the direction I mean).

yarikoptic commented 3 years ago

Thank you @pvavra - (actionable for humans and computers) provenance of data transformations is indeed one of killer features. You suggestion sounds not too clunky and to the point as to me.

yarikoptic commented 3 years ago

I have added the :-1: choice for the :tada:'s refinement and added a clarification. Everyone who voted (especially for :tada:) please consider adjusting your vote or expressing explicit (comment) preference for :tada: over :-1:

yarikoptic commented 3 years ago

the choice was made and it is

title: 'DataLad: distributed system for joint management of code, data, and their relationship'

in the paper