ContentMine / meta

A repository in which to file and fix meta issues (issues affecting more than one ContentMine repo or project)
0 stars 0 forks source link

Choose code hosting service #2

Closed ghost closed 7 years ago

ghost commented 7 years ago

@petermr and I have discussed the possibility of migrating ContentMine's code hosting to somewhere more free (as in freedom) than GitHub, e.g. GitLab.

If ContentMine did migrate from GitHub to GitLab, then:

Criterion GitHub GitLab Criterion winner GitHub score subtotal GitLab score subtotal
GNU Ethical Repository Criteria Grade: F Grade: C GitLab 0 1
Supports nested groups No Yes GitLab 0 2
Non-Five Eyes jurisdiction No (USA) Yes (Netherlands) GitLab 0 3
Ability to move issues between repos No Yes GitLab 0 4
Ability to attach arbitrary files to comments No Yes GitLab 0 5
ToS may conflict with free software licenses Yes No GitLab 0 6
HTTPS support for hosted pages with custom URLs No Yes GitLab 0 7
Unlimited private repositories gratis No Yes GitLab 0 8
Unlimited Kanban-style issue boards gratis No Yes GitLab 0 9
Unlimited public repositories gratis Yes Yes Tie 1 10
Very fast interface Yes Varies GitHub 2 10
Most prominent code hosting service Yes Varies GitHub 3 10

If ContentMine does decide to move:

Comments welcome! In particular, it would be great to know if anyone thinks ContentMine should not migrate to GitLab (and why not). Thank you :)

ghost commented 7 years ago

@jcmolloy, I understand you looked into using ZenHub for @ContentMine. Would you be OK with ContentMine using GitLab Issue Boards instead (e.g. if ContentMine migrated to GitLab)?

tarrow commented 7 years ago

I like GitLab. I think they are a good team and I agree they are more ethical than github.

There are a few disadvantages to moving away from it though:

In my opinion right now I don't personally think the advantages of moving outweigh the cost. I don't believe that from the comparison table above we would derive much advantage (although clearly by these criteria GitLab outperforms Gihub). For each point in total:

ghost commented 7 years ago

@tarrow wrote:

I like GitLab. I think they are a good team and I agree they are more ethical than github.

:-)

There are a few disadvantages to moving away from it though:

  • IMHO github is the defacto location for the FOSS community. It's sort of the sourceforge of our time. I think if we want to attract contributors this is the place to have a presence

Added a line for this in the table above.

I would note that contributors who deeply value free software might be more likely to be found on GitLab than GitHub. And I would argue that these are the new contributors that ContentMine needs most.

  • Because of this some external tools are very nicely integrated only with github (e.g. npm install contentmine/getpapers get master from github)

"Since NPM v3 there is built-in support for GitLab and other sources (BitBucket, Gist)" (source). For more, see:

So, at least in the case of npm this objection appears to be outdated. In any case, it would be a hostage to fortune for ContentMine to depend upon tools that tie it to any particular proprietary platform.

  • There is a cost to migrating repositories in team member time

Agreed, but as noted above, if ContentMine is to migrate, it "should start sooner rather than later, to minimise the amount of content needing to be migrated."

Also, GitLab has an import tool, which may speed things up considerably.

  • Github is bigger/perhaps more stable. I think the longevity of github is probably a little higher than GitLab

This is highly speculative :)

But if the demise of the company behind either service is a concern for you, I would note that as the GitLab software is FOSS and can be self-hosted:

OTOH, if GitHub, Inc., were to fail, then migrating to a functionally different piece of repository-hosting software would almost certainly be necessary.

In my opinion right now I don't personally think the advantages of moving outweigh the cost. I don't believe that from the comparison table above we would derive much advantage [...]

Understood. I respectfully disagree, but it's useful to have your viewpoint.

  • Supports nested groups: We can use submodules on github I believe; this isn't represented in the UI but this is part way there. I'm not sure we'd benefit from it in the UI

The UI is where much of the value would come.

ContentMine currently has 70+ repositories on GitHub. Speaking as a new contributor, this is a rather overwhelming number of repositories to wade through in order to figure out what each of them is for. Organising the repository list in some easily-comprehensible way (e.g. by grouping related repositories together) would go a long way to fixing that.

For example, these repositories relating to obsolete or soon-to-be-obsoleted websites could be grouped together:

In GitLab, this is easy. In GitHub, it seems to be impossible.

IMO, improvements like this would make onboarding new team members substantially easier, and perhaps also improve the rate of drive-by commits.

  • Non-five eyes based country: most of the CM team are based in 5 eyes country already; I'm not sure this is currently a core issue for our users. Many very large, successful and privacy oriented FOSS project's have code hosting in a 5 eyes country.

This may be the status quo, but it is not ideal and cannot be fixed by inaction.

Digital rights and censorship are core concerns for ContentMine. Mass surveillance is one mechanism by which digital rights are eroded and censorship can be facilitated. Taking business to organisations in less hostile jurisdictions where possible is a reasonable reaction.

  • Move issues between repos: This functionality is provided by zenhub
  • kanban: free from zenhub

ZenHub is proprietary SaaSS, like GitHub, so I would strongly prefer not to use it. I appreciated the significance of the free/proprietary distinction too late to prevent me from signing up to use GitHub years ago, so all I can reasonably do with GitHub is gradually reduce my dependence on it (and the dependence upon it of groups in which I participate). With ZenHub, OTOH, I do not have an account and hope never to need one :)

Also, ZenHub lock-in might be worse than GitHub lock-in. Is there a mechanism that would enable easy migration from ZenHub in the future? As with GitHub, it seems unwise for ContentMine to depend upon tools that tie it to any particular proprietary platform.

Thanks for spotting this. I've updated the relevant row in the table to emphasise arbitrary file attachments.

OTher text files can go in gists. I'm not sure what other files we particularly want to attach to comments

ContentMine staff or contractors might want to attach LibreOffice. Clients might want to attach JARs if they think the JARs have bugs in. You get the idea :)

I am not a fan of attachments, personally, but in order to maximise the number of ContentMine contributors who can engage with issues, it would be very useful if attachments can be arbitrary files, so that all participants including non-techie people (who might not know about file types) would be able to use them without friction.

  • Licensing: Most (but not all code) we have is already BSD type apache licensed.

There might be good reasons to change the license in future.

It doesn't seem (but I might have missed it) that GPL'd projects are leaving github.

And some FOSS developers (e.g. Joey Hess) did indeed leave GitHub out of concern at its latest ToS.

  • HTTPS support for custom urls: I don't think we have any custom urls. github.io pages do have https

No custom URLs at the moment, I acknowledge, but it could be a useful feature to have.

  • Private repos: I'm not sure we currently need these (may we will in the future though)

I see two potential use cases:

  1. Internal tools not for public release. (The number of such tools should be minimised, but need for them may arise.)
  2. Discussing strategic issues in private repositories might be a good alternative to Slack (which is also proprietary, and which I do not use and would prefer not to have to use). Factor in GitLab's Kanban boards and IIUC this would be much more powerful than Slack. N.B. Shuttleworth is considering migrating away from Slack :)
aral commented 7 years ago

Since I was mentioned, a quick note: Been running GitLab on https://source.ind.ie for about four years now and it was one of the best decisions we ever made. Being able to group project-related repositories is awesome (e.g., https://source.ind.ie/better) and I even use GitLab as part of Better’s architecture (Better makes extensive use of Git as the data layer – the all iOS/Mac app instances have working copies of the block lists.)

Our instance currently has 542 accounts and 281 projects and runs reliably on an 8GB RAM/60GB HD droplet on Digital Ocean. (If I was setting it up today, I would do so on CloudScale.ch, where we host better so as not to have in the US – when I find the time, I will move it over.) Also, in hindsight, I don’t think we should have opened the instance for signups. Ideally, everyone should run their own instance. Viva decentralisation ;)

If you do decide to go this route, make sure you install the Omnibus package and not from source if you want seamless upgrades. Converting from the latter (a decision I regretfully delegated) to the former was a major headache for us. Nothing beats a sudo apt-get update && sudo apt-get upgrade for regular maintenance – especially when you’re the sysadmin also. (And, needless to say, please run it with TLS. Let’s Encrypt is your friend and mine.)

Anyway, this is basically a big 👍 for GitLab from me. Best of luck with your project (it sounds pretty awesome) no matter which route you decide to go :)

tarrow commented 7 years ago

@blahah do you currently have any opinion/input on this?

What do you feel the situation is currently like with page loads on gitlab? Wasn't it painfully slow before? It still feels slow to me but it might be the internet here.

How do you feel about the big bungle early this year with deleted user data? It was good they were so open about it but still really sucky that they had such terrible quality control/infrastructure before. I doubt the same bug will happen again though but it is still a concern that the engineering culture is perhaps over stretched.

I agree that groups would be nice but I'm not sure they're that essential. We should tidy up old repos if we stay on github too.

I'm not convinced that Kanban boards is a replacement for slack (we also talked about replacing slack but I think we may want to consider mattermost or matrix riot or similar). We should use them though

I'm also not convinced that we are more likely to get either good quality drive my commits or new "proper" contributors by using GitLab. For me this is the main reason not to change. For example the most popular project of all time on GitLab is gitlab's own code. It has around 1/3 the total number of stars of the most popular project this month on github (facebook/prepack). I really think this would reduce our user/dev exposure.

@aral Thanks! This is really great advice. Any reason you decided to host your own and not use the hosted gitlab.com service? It's really good of you to drop in here and comment :)

ghost commented 7 years ago

@tarrow wrote:

Wasn't [GitLab] painfully slow before? It still feels slow to me ...

There was already a row for this in the table above: "Very fast interface" ;)

How do you feel about the big bungle early this year with deleted user data?

Regrettable for sure, but GitHub is also not 100% reliable:

As for how to mitigate against a code hosting service losing data? Mirroring to other hosts (GitHub and/or BitBucket and/or NotABug.org and/or self-hosted, etc) seems reasonable. It is what the Software Freedom Conservancy does, for instance: https://github.com/conservancy/website

We should tidy up old repos if we stay on github too.

:+1:

I agree that groups would be nice but I'm not sure they're that essential.

None of this stuff is essential ;-) Git is a DVCS designed to work over email. ContentMine development could happen over Git without a code hosting service at all. But if ContentMine wants the conveniences afforded by code hosting services (and rightly so), why not pick the one with the greater conveniences? :)

I'm not convinced that Kanban boards is a replacement for slack (we also talked about replacing slack but I think we may want to consider mattermost or matrix riot or similar). We should use them though

Cesar and @petermr and I would all like a reasonably centralised way to collaboratively and asynchronously track projects and issues, that does not rely on proprietary software. GitLab honestly seems like a reasonable fit for this: more so than GitHub or Slack or a combination of the two.

As for Mattermost: I see the advantages it has over Slack (free vs proprietary; more secure; etc), but I do not see that is has any obvious advantages over GitLab for this use-case. (If you think I have misunderstood, please say!)

I'm also not convinced [that the GitLab community is comparable to GitHub]. For me this is the main reason not to change.

I understand this misgiving. However, it seems to me unlikely to have any impact on ContentMine in the short term because the overwhelming majority of ContentMine commits are from people already involved, and issues like #1 and #4 make it hard for others to contribute. (Please correct me if I have misunderstood.)

In the medium to long term, I wager that ContentMine's ability to generate an active community of contributors will depend more upon how well ContentMine handles other aspects of what the Rust folk call "community automation" than it will upon which code hosting service ContentMine uses :)

tarrow commented 7 years ago

So, as I mentioned before I can clearly see there are advantages to GitLab. There are also disadvantages to moving and indeed some things that are worse about GitLab than GitHub.

To me the features just feel far too marginal right now to got through the upheaval of moving, losing our existing place in the social system and going somewhere far less popular.

I think we need to get the opinions of other stakeholders. e.g. @skasberger @chreman @chartgerink Would also be good to hear from our fellows and people who currently use software that's hosted on our github (even if they don't contribute patches)

markmacgillivray commented 7 years ago

Decisions were already made at the start of the project on which services to use, and why, based on the requirements of the project. Why is this issue being raised now? Have project requirements changed, or has github service changed, in a way sufficient to justify the project making the effort to move?

Deciding what to do for the project should be based on project requirements, not ideology.

Unless project requirements have changed to include greater emphasis on these ideological points, this issue seems moot. If project ideology is changing in a way that affects requirements, that should probably be discussed as a team - it won't be possible to answer this question if people are coming at it with different understandings of what the project needs.

blahah commented 7 years ago

I use both Github and GitLab (a hosted instance of CE rather than gitlab.com) daily. Whether to move really depends on the goals of the organisation's presence on web social coding sites.

I think the comparison list above is somewhat arbitrary - these are just random features that don't map to the goals of the project. To address a few though:

If the idea is to raise awareness and build community around the tools and project, Github is clearly the winner. Most programmers in open source use Github. Most open source projects work on Github. I am involved in hundreds of projects on Github and use it many times a day because it's technically excellent and the community is here. I use one Gitlab instance because I have to, but if I wasn't being paid to do so, would never use any Gitlab instance, and don't use Gitlab.com. If ContentMine moves to Gitlab, I am confident having discussed it with many people and projects that the broader community will not follow any time soon. There will just be fewer contributors. Personally, unless it happens to be the same instance I already get paid to use, I won't add another digital workplace site to the already-too-large list of ones I have to use.

If the idea is to promote openness (as opposed to GNU's version of freedom, which again has never been a core part of the project's philosophy that I know of) in the entire ecosystem, even at the cost of success of this project, then Gitlab CE is the best option, hosted by the project or another aligned project (https://gitlab.coko.foundation might be a good choice - run by another Shuttleworth fellow-linked project). Using gitlab.com is ethically little different to using Github in my opinion - there's no way to tell that they are using the same source code on the server that they release, and the Enterprise Edition has many closed-source features.

If the idea is to provide a productive environment in which the members of the project can collaborate, it comes down to user preference, but personally I find the Gitlab user experience vastly inferior to GitHub. Basic daily tasks require many navigation clicks to reach (for example, viewing the output of a CI job). The interface itself was originally cloned from Github and every step it's taken away from that has got more convoluted and less productive. Integration with external services and tools is negligible by comparison, project and community discovery are bizarrely difficult, and it has many small and some large bugs that sometimes disrupt work.

Overall, I'm against ContentMine moving to Gitlab. Perhaps it will get to the stage where it's generally competitive with Github, but at the moment I don't think it is even close. I actually think it's more likely that Github will become open source than Gitlab will draw the community significantly away from Github. Most importantly, specifically for this project, I don't think it serves the needs or interests of the project to migrate.

chreman commented 7 years ago

I recognize that according to the initial table the choice of code hosting service may look decided, the following discussion added nuances and context. Switching the service does not solve the problems the project needs to be solved, which are those of long term software strategy, getting contributors in and keeping them. If GitLab continues to grow/be stable, then a switch at a later time will not be a problem, given that convenient import functions already exist. Until then we should focus our resources on the critical issues, which are social and organizational, and only secondary technological. Any additional consideration/side project we have to make during that time unnecessarily adds complexity and takes scarce brain/time resources away.

addshore commented 7 years ago

If ContentMine moved away from Github it would essentially fall out of my day to day workflow / visual space.. Also, for experience, you're going to spend a lot of time talking about where to host your code, when your code is already hosted somewhere.

If ContentMine did migrate from GitHub to GitLab, then:

If ContentMine did NOT migrate from GitHub to GitLab, then:

skasberger commented 7 years ago

totally agree with @blahah .

ghost commented 7 years ago

@markmacgillivray wrote:

Decisions were already made at the start of the project on which services to use, and why, based on the requirements of the project.

Thanks. Please can you point me to a record of the process by which that decision was made?

Deciding what to do for the project should be based on project requirements, not ideology.

To be clear: I am speaking for myself here and throughout, not for anybody else.

ContentMine is a highly ideological project.

A key problem that ContentMine is trying to solve is that access to knowledge is mediated by companies positioned as gatekeepers at bottlenecks in the information flow. This puts knowledge seekers at the mercy of those companies: the companies can block or otherwise hinder knowledge seekers.

Academic publishers sometimes do this, e.g. Elsevier massively impeded @chartgerink's academic research.

Code hosting services also sometimes do this, e.g. sourceforge.net bars users from Cuba, Iran, North Korea, Sudan and Syria; GitHub.com provides a degraded service to users who disable proprietary JavaScript, and provides only limited data portability.

Moreover, a company that behaves benignly in any given way today might not do so tomorrow. Not long ago, SourceForge, which was once reputable, suddenly hijacked several high-profile projects, locked out their developers, and trojan-ed their installers. Read these accounts from nmap, VLC, and GIMP developers.

ContentMine is a free software project, and free software projects need free tools.

As long as access to ContentMine is mediated by a proprietary gatekeeper - and make no mistake, GitHub is a gatekeeper - ContentMine is not, IMO, 100% succeeding in its aims. (GitLab would also be a gatekeeper, but one with much closer ideological alignment with ContentMine, much lower scope for lock-in or for government-mandated censorship, and a much richer toolset. So, as much a partner as a gatekeeper, unlike GitHub.)

Why is this issue being raised now?

Have project requirements changed, or has github service changed, in a way sufficient to justify the project making the effort to move?

Not as far as I am aware, but IMO there are other, more than sufficient justifications. See the discussion above.

If project ideology is changing in a way that affects requirements, that should probably be discussed as a team

I doubt it will be possible to get every ContentMine stakeholder into any single meeting or discussion thread, but the discussion is happening, as you can see here :)

markmacgillivray commented 7 years ago

As to a record of the discussion in which github was chosen, I do not know if anyone recorded the meeting, but here we are on github.

I am aware of the ideology of the project, and of that of the founder, as to my knowledge I was the first person the founder asked to join the project, and have previously worked with him on many others. I am also aware of the events and issues you have linked to throughout these discussions, and to the structure of the foundation, being one of the founding directors.

However, you are not getting my point. It does not matter how many arguments for or against free software, or the different meanings of the word free, there are, nor how many events or issues are referred to - it is an ideological issue, not a requirements issue. This issue may have become the discussion about the ideology of the project after I highlighted that point, but presenting this issue as "Choose code hosting service" if the actual meaning was "Discuss project ideology" is misleading and/or disingenuous, depending on initial intent. Knowing the ideology of the project at the outset, and understanding that there are different ways to realise an ideology, I am saying that it does not appear to be the case that the project team is aligned on the matter of ideology.

I, and many others, disagree on what it means for software (or anything) to be free, and on what is required to reliably maintain free software. I also disagree that companies can or cannot inherently be trusted to provide a reliable service, and certainly disagree that companies cannot be as efficient or reliable as any non-company group (some are, some are not). Just because you call a group a different thing, it does not stop that group being at risk of any of the same naughty behaviours you have listed about companies, and I am sure you are also aware of exemplar issues that have befallen many open source projects in the past, regarding their ideologies and their changing commercial or non-commercial (whether inadvertent or not) natures.

So the point still remains - at any time that the service we are currently using suddenly turns against us, there is nothing stopping us taking all the code and supporting info that is currently here and moving it to an alternate provider AT the time such an event occurs. To do so beforehand clearly requires considerable effort - if not technically, then socially, as this issue right now demonstrates.

None of the arguments so far re. hijacking, community, security, etc, convince me that github is any worse than gitlab or an alternative - any of them could be subject to problems down the line, and as long as the project can move at some time, then the issue should be dealt with when it arises. Also, it is still not clear why our users could be expected to have the concerns supposedly solved by a move away from github at this time.

In my understanding, the purpose of the contentmine project was to prove that the change in copyright law in the UK at the time the project commenced was of great value to researchers. Therefore, the point of the project was to show that doing things IN jurisdictions such as the UK, using commonly available tools, and within current law, is both possible and defensible. The fact that it may still not be safe to do so is exactly the sort of thing that this project should be highlighting, and the solution is not to remove ourselves to places where it IS safe. What this project does, by definition, should be legal and defensible for users to be seen doing. If it is not, and if this project has to rely on technical/jurisdictional avoidance strategies, then I would say this project is failing.

Believing one KIND of thing is inherently better than another kind is just another form of fundamentalism. The very hard task facing any project hoping to change the world towards some proclaimed ideal is that in doing so we cannot just remove ourselves to the pristine visions we seek to exalt. To change the world to your liking, you have to do things in the places where the unbelievers live. You have to do it in public. You even have to be willing to do it publicly, at the cost of anonymity. Nobody else will change if they can't see anyone who believes hard enough to be seen doing it. Anonymity and security at the cost of exposure (which is by definition not anonymous and less secure) will kill a project that seeks to achieve ideological goals, and the void will be filled by those that claim to provide all of the convenience of heaven without the inconvenience of self-sacrifice.

Having said all this, my reasons for being less involved in this project do relate to inconsistency in decisions about the project direction, and I have moved on to other work. I have provided some input here and to recent issues as it seemed that some experience from the before time may have been of value, but I should probably no longer have a say in such decisions, and do not intend to muddy the waters further.

ghost commented 7 years ago

@markmacgillivray wrote:

As to a record of the discussion in which github was chosen, I do not know if anyone recorded the meeting, but here we are on github.

Fair enough :)

If you (or anyone else reading this) does at some point find a record of how the decision to use GitHub was reached within ContentMine, I would be very grateful if you could post a comment linking to it. I realise that ContentMine is not Debian, but I have point 5 of Debian's Code of Conduct in mind as I write this:

Most ways of communication used within Debian allow for public and private communication. As per paragraph three of the social contract, you should preferably use public methods of communication for Debian-related messages, unless posting something sensitive.

This applies to messages for help or Debian-related support, too; not only is a public support request much more likely to result in an answer to your question, it also makes sure that any inadvertent mistakes made by people answering your question will be more easily detected and corrected.

I.e. I have long wondered why ContentMine was using proprietary code hosting services, and I doubt I am alone. To answer that question, it would be good to have a public record. If there isn't one already, then this thread will serve as one :)

I am aware of the ideology of the project, and of that of the founder, as to my knowledge I was the first person the founder asked to join the project, and have previously worked with him on many others. I am also aware of the events and issues you have linked to throughout these discussions, and to the structure of the foundation, being one of the founding directors.

That's awesome. Thanks for your big role in making ContentMine happen! :)

However, you are not getting my point [...] it is an ideological issue, not a requirements issue.

Can we at least agree that choosing a code hosting service (or any other piece of project infrastructure) should involve:

  1. figuring out and then stating the requirements (be they practical, ideological, or other),
  2. evaluating the available options against those requirements, and
  3. choosing the option that best meets those requirements?

As I could see no evidence that this had been done in relation to ContentMine's code hosting, and because the option currently being used seems to me to be less than optimal from practical and ideological standpoints, and because there were discussions happening offline about the topic, it seemed to me to be worth filing this bug, so that steps 1-3 could happen in the open and on record.

I really intended this to be a constructive step. I am sorry if you feel it has not been.

This issue may have become the discussion about the ideology of the project after I highlighted that point, but presenting this issue as "Choose code hosting service" if the actual meaning was "Discuss project ideology" is misleading and/or disingenuous, depending on initial intent.

:(

it does not appear to be the case that the project team is aligned on the matter of ideology.

As it would be with just about any group of people, I am sure that is indeed true of ContentMine :)

I, and many others, disagree on what it means for software (or anything) to be free, and on what is required to reliably maintain free software.

Absolutely :)

I also disagree that companies can or cannot inherently be trusted to provide a reliable service [...]

I am not sure who you are disagreeing with about this :) If you are referring to my having said, "A key problem that ContentMine is trying to solve is that access to knowledge is mediated by companies positioned as gatekeepers at bottlenecks in the information flow," the problem isn't necessarily that they are companies, but rather that they are positioned as gatekeepers at bottlenecks in the information flow.

Just because you call a group a different thing, it does not stop that group being at risk of any of the same naughty behaviours you have listed about companies

Totally agree :)

at any time that the service we are currently using suddenly turns against us, there is nothing stopping us taking all the code and supporting info that is currently here and moving it to an alternate provider AT the time such an event occurs. To do so beforehand clearly requires considerable effort - if not technically, then socially, as this issue right now demonstrates.

Migrating Git repositories out of GitHub is easy. Migrating issue discussions and other metadata (webhooks, milestones, etc) is not so easy, and GitHub has the power to make it extremely costly, if they want to.

At the moment, the number of repositories is high, the amount of other data is small, and the barriers to migration are still low. If ContentMine is to move away from GitHub with minimal effort, surely better to do it while those conditions obtain.

Those conditions will not obtain for long: there are new people likely to be working on the codebase, new issues likely to be filed, new projects to be managed, etc. As I put it above, "If ContentMine does decide to move [...] it should start sooner rather than later, to minimise the amount of content needing to be migrated."

I guess this extends my answer to your question above: "Why is this issue being raised now?" :)

None of the arguments so far re. hijacking, community, security, etc, convince me that github is any worse than gitlab or an alternative

A future migration path away from GitLab.com is much clearer than from GitHub:

  1. Regularly back up the ContentMine data hosted at GitLab.com.
  2. If GitLab.com fails, copy the data into a GitLab instance that is hosted elsewhere (e.g. https://gitlab.coko.foundation as suggested by @blahah) or that is self-hosted as advocated by @aral .

There would be no new UI for the team to learn; no incompatibility between the data structures; fewer broken links; minimal lost data; etc.

In my understanding, the purpose of the contentmine project was to prove that the change in copyright law in the UK at the time the project commenced was of great value to researchers. Therefore, the point of the project was to show that doing things IN jurisdictions such as the UK, using commonly available tools, and within current law, is both possible and defensible. The fact that it may still not be safe to do so is exactly the sort of thing that this project should be highlighting, and the solution is not to remove ourselves to places where it IS safe. What this project does, by definition, should be legal and defensible for users to be seen doing. If it is not, and if this project has to rely on technical/jurisdictional avoidance strategies, then I would say this project is failing.

That is an interesting argument. I guess you mean this change in copyright law. I will sleep on this :)

Believing one KIND of thing is inherently better than another kind is just another form of fundamentalism.

I don't really see how that is relevant to the discussion. I can't see anyone here claiming that, for example, free software (or anything else) is inherently better than any other thing.

I note, and agree with, @makoshark's essay When Free Software Isn't Better.

The very hard task facing any project hoping to change the world towards some proclaimed ideal is that in doing so we cannot just remove ourselves to the pristine visions we seek to exalt. To change the world to your liking, you have to do things in the places where the unbelievers live.

Maybe, but that does not mean that:

You have to do it in public.

This part, we agree on. Cf. Debian :)

You even have to be willing to do it publicly, at the cost of anonymity.

Some participants might need to be onymous, but they certainly do not all need to be. The Tor project and other successful free software projects have anonymous contributors who are of great value.

The point is: your allies should not to be forced to choose between (a) maintaining privacy or (b) contributing to the project. Especially if they live in a jurisdiction where contributing would put them at great risk, even if their actions are inadvertent.

Having said all this, my reasons for being less involved in this project do relate to inconsistency in decisions about the project direction, and I have moved on to other work. I have provided some input here and to recent issues as it seemed that some experience from the before time may have been of value, but I should probably no longer have a say in such decisions, and do not intend to muddy the waters further.

I am very grateful for the thoughtful feedback and helpful context you and others have given here and on other issues I have filed. Speaking for myself (as in this and all other threads), I am also very grateful for everything you have done that helped to get ContentMine off the ground and make it viable. I really hope that you will be able to contribute more in the future as, when and how you would like to, and that our paths will cross: it would be good to meet you (and everyone in this thread!). Thank you again :)

blahah commented 7 years ago

Just a couple of points:

petermr commented 7 years ago

Thanks to everyone for the discussion - please keep it going. I listen carefully and constructively to everything said. I use "I" but I am sure that Jenny's view are similar.

My immediate comments are:

Openness and Freedom have different interpretations. SF require code under a FOSS licence. I took the view early in the project that our code would serve its purpose best if it could be used by commercial organizations. I therefore chose Apache2, and although that was not the mainstream in SF I argued the case and Karien and others were happy. If there is the likelihood that our code might be used against us and the community (e.g. by being copyrighted or patented without our permission) we might have to change licences.

ContentMine is a social purpose / public interest company - our mission (which we are actively discussing for our marketing and publicity is threefold:

These three pillars interact.

My own views are that within the area of scientific knowledge we remain true to a quasi-absolute vision. It maps onto RMS's 4 freedoms, the Open Knowledge's Open Definition, (which I have been actively engaged in), the Panton Principles of Open Scientific data, the Hague Declaration and others. The definition of Open is not really negotiable. I promise you that I will not sell out to publishers and the open lock is a public obstacle to ensuring this.

The company's main concern is practice within the scholarly publishing industry and academic research. There are a number of companies that CM will "never" do business with but also, hopefully, an increasing number that we will. CM supports the Open movement but treats other domains more pragmatically. I personally will not work with the armaments and tobacco industry but most others are potential customers or clients. The main criterion is whether the company actively promotes enclosure of commons, restriction of human and personal rights, deceitful practice, etc. And increasingly misuse of the Internet vision, such as monitoring , falsehoods, lobbying etc.

When a company has the trust of the community and then is taken over by a commercial organization with no public governance there is reason for concern. I warned the world about Mendeley, was taken by surprise by SSRN, have challenged the lack of governance in Figshare, etc. But Slack seems to me a company which has grown (perhaps unexpectedly) by natural growth and has a useful product. If there is evidence it is misusing its product (surveillance, selection, etc.) we need to know. If it was a monopoly there might be a problem.

Generally we have bigger concerns. Any monopoly is potentially a problem, any lobbyist, anyone misusing our community. For example I am more worried about Google, though I admit to using it.

Happy to take reaction. I am not a BDFL - I am more a facilitator of a community of the future since I am many years physically older that you the future is young people.

P.

Briefly commenting on Mark's observation that he and the project had drifted apart - this is not on principles but on different visions of particularly activities we should prioritise. In fact in 2016 the decisions was, to a considerable extent, taken away from us because we cannot use the Hargreaves law unless Universities positively support it - and we have not found ones in the UK that do. That means that we cannot follow our original vision of mining facts from the whole of science - we have to find selected areas which are important and wanted - which is our mainstream. But that does not mean we are not continuing to advocate and challenge at every stage - just that tactics vary.

mjw99 commented 7 years ago

I do not participate in this code base anymore, but was pinged due to a cross reference mention in another ticket.

I disagree with the proposed migration from GitHub; I find all the arguments against migration, which have been presented in this thread, cogent and compelling.

In addition, you are proposing a migration of a code base to another location. Given that it has not been migrated over completely to its current location, this seems illogical and will generate more problems than it solves. Before even proposing and debating a second move, you should actually migrate to GitHub properly, canonicalise these repositories and then retire any old repositories.

You underestimate the disruption a migration causes to the developers in a project. Focus on countering code regression; nothing quite scares prospective developers away as code that used to compile.

[edit spelling]

petermr commented 7 years ago

I am closing this discussion. There have not been and are not plans to migrate any of the current repos either from Github or from Bitbucket. We will continue to refactor the ami stack and intend to preserve the APIs. We are currently planning a versioning strategy that separates prototyping, development and release.

Thank you all for very useful discussion

ghost commented 7 years ago

Edit: @petermr , your comment above was not showing when I posted this one, sorry.

Thanks, all, for your useful feedback.

I owe a mea culpa: I could have presented this issue better. I had not expected it to be so contentious.

Having slept on the matter and having discussed it offline with @petermr, @tarrow, and @jkbcm (thank you!), I better understand why the issue, and the way in which I raised it, resulted in so much opposition.

IMO, my missteps included (but were probably not limited to) the following:

I regret those aspects of the discussion, and for your patience with me, I thank you.

I hope you will agree with me that it was good to have a public, open discussion about this. When an ostensibly open project makes major infrastructure decisions offline or without public documentation, it fails in that aspect of its open mission. This discussion has remedied that, in this particular case.

I do still think that the project has practical needs (and indeed ideological goals) that are suboptimally met by the combination of tools it currently uses, and that using more appropriate tools would help the project better address:

Moving forward

Concluding this phase of discussion

The majority preference appears to be to use GitHub for ContentMine's code hosting, to use Slack or something like it for pseudo-private discussions and document hosting, and to use something else again for other document hosting. I will therefore close this issue. Please re-open it if you think that set-up is suboptimal.

IMO, this is a suboptimal outcome, both practically and ideologically. Hence, to a large extent, my regrets above.

Personally, I do not have a Slack account and would still strongly prefer not to have one. I hope that using it will be something I can avoid. Signing up to new, proprietary services is a red line for me: I will only cross it if there is no other way to keep a roof over my head.

Addressing unresolved concerns

So that the concerns in the bulleted list above can be better addressed, I hope to spend a small amount of time each week:

Think of these criteria as a sort of test suite for the tools. If any of the current tools comes to feel like it is failing an unacceptable number of criteria (i.e. failing to keep the project efficient, effective, and open), then ContentMine will have a clear sense of the alternatives available, and hopefully a clear migration path.

If I get to the point where I am ready to have feedback on that document, I will ask for it. I have no plans to implement any of its conclusions without seeking adequate feedback.

Loose ends

I mentioned above that there have been some misunderstandings. I attempt to remedy them below.

@petermr

Thanks to everyone for the discussion - please keep it going.

:)

Peter's remark notwithstanding, I know everyone in ContentMine is busy, and this issue is nominally resolved now, so please do not feel obliged to reply to me, at least :)

@markmacgillivray

In my understanding, the purpose of the contentmine project was to prove that the change in copyright law in the UK at the time the project commenced was of great value to researchers.

I appreciate that, having been deeply involved in ContentMine from an early stage, you have inside knowledge that I lack.

I have now cloned the contentmine.org repo, to be able to read (without having to execute JavaScript served over a plain HTTP connection) what ContentMine's website says about ContentMine's aims. It does indeed say that, "in 2014 the UK introduced the Hargreaves exception, stipulating that anyone in the UK with the right to read a document also has the right to mine it, overriding any copyright / jurisdictional restrictions. So Peter put together a team of people to take advantage of this, and to develop tools and software that could extract uncopyrightable factual data from research articles and share it without restriction to the world."

I note, though, Peter's caveat above about Hargreaves from 2016. I also note that little or nothing about Hargreaves is mentioned in much of ContentMine's publicity; likewise in Peter's typical descriptions of ContentMine's core goal ("to extract facts from the scientific literature" and thereby to "change the world"); and likewise in Peter's tripartite account of ContentMine's core goal:

  1. advocacy
  2. tools
  3. community

I hope this helps explain why I came to this discussion with a very different understanding of ContentMine's aims than yours.

Therefore, the point of the project was to show that doing things IN jurisdictions such as the UK, using commonly available tools, and within current law, is both possible and defensible.

I am not sure what you mean by "jurisdictions such as the UK". Commonwealth realms? I can see how basing ContentMine in the UK relates to Hargreaves, but I cannot see how this means GitHub (which is not in a Commonwealth realm) is an appropriate or optimal code hosting service for ContentMine.

The fact that it may still not be safe to do so is exactly the sort of thing that this project should be highlighting, and the solution is not to remove ourselves to places where it IS safe. What this project does, by definition, should be legal and defensible for users to be seen doing. If it is not, and if this project has to rely on technical/jurisdictional avoidance strategies, then I would say this project is failing.

I agree that the project's actions should be legal and defensible, but in the real world that is unlikely to be possible in all jurisdictions. Should ContentMine not prioritise its core goal of liberating facts from the scientific literature, even if that means utilising technical/jurisdictional strategies to make this lawful, rather than declaring failure just because of any particular jurisdiction's intransigence?

@blahah

I think the comparison list above is somewhat arbitrary - these are just random features that don't map to the goals of the project.

I picked factors where GitHub and GitLab are clearly different, in ways that have practical ramifications, but I realise that this caused confusion.

I apologise for not doing a better job of communicating why I picked those factors.

ContentMine has never as far as I know subscribed to the definition of freedom espoused by GNU (which is why we don't use the GPL in any of our projects AFAIK). [...] ContentMine has never been a 'Free Software' project, it has always been open source. 'Free Software' is an ideology, and again, many of us don't subscribe to it.

It sounds as though you are conflating two separate things:

  1. The Four Freedoms;
  2. Copyleft.

ContentMine cleaves to the former, but not to the latter. I say this based on pieces such as this, and based on the fact that ContentMine's code and content respectively are published under either free software licenses or free culture licences, but not under copyleft licenses.

I call a spade a spade. When I see a project that develops and releases free software as one of its core activities, I call that a free software project. So, I call ContentMine a free software project. I hope you can accept that this is reasonable of me, even if you prefer different terminology yourself.

As for the term "open source": in the absence of any further clarification, I understand that to mean only that Freedom 1 of the Four Freedoms obtains, saying nothing about the other three freedoms. So, I agree that ContentMine could also be described as an open source project, though this description sells ContentMine short ;)

(N.B. I have long known about, and had respect for, the OKFN and the Open Declaration. I have also known about the OSI.)

[Arguments] based on a particular stance on e.g. freedom cannot hold weight unless they are agreed as core community values. [These values] cannot be assumed. [...]

Prior to this discussion, the impression I had gained was that, especially given @petermr's Shuttleworth Fellowship, such values could be assumed to be important, albeit not absolute. Following discussions with @petermr, and his post above, this is still my impression. He told me yesterday that he does take such matters seriously, weighing them against other concerns (practicality, financial cost, etc) on a case-by-case basis.

Any arguments based on the ideology don't hold weight in operational decisions.

This is not my understanding. See above.

I might not hold weight in operational decisions, though :)

I don't see at all how Gitlab replaces Slack - there are no realtime chat features, which is what Slack does (badly, imo).

I think you have answered your own point. The other people in ContentMine who I have observed using Slack are using it to host asynchronous, threaded discussions with occasional attachments. GitLab easily handles asynchronous, threaded discussions with occasional attachments.

We actually moved to Slack specifically because a lot of contributors (e.g. the board of advisors) were unhappy with issues, or any developer/task focused option, for discussion.

Clearly, they can't have evaluated all of the developer/task focused options available. It would be useful to know which one(s) they did evaluate, and what made them unhappy about it/them. Is this documented anywhere?

Mattermost ... Gitter

I don't see the utility in these. Better to keep things under one, flexible roof than to disperse code hosting, discussion and document hosting across multiple tools, making searching impractical and leading to duplication of discussion/effort/etc.

ContentMine doesn't use private repos, and apart from the idea that these could serve as a discussion platform, for which see the previous point, I don't see why it would.

Previous point addressed above; discussion would be a very valid use-case.

Another key reason would be for document review prior to publication. Suppose team member A wants to run an embargoed, plain-text document draft (e.g. a press release, or a blog post, or documentation for a client project that was completed ahead of schedule but not yet paid for, or whatever) past team members B and C in advance of publication, in a way that allows them all to make edits and comments, with a record of who did what. A private GitLab repo would be ideal for that, allowing for line-by-line comments, alternative wordings, diffs, etc, and the convenience of letting the team members work almost completely off-line.

GitLab renders Markdown, Org-mode, etc. just fine. For people who don't use text editors routinely, it allows text editing in the web UI. This eliminates the need for word-processing, keeping RSI low and consistency and productivity high.

You can attach arbitrary files to Github comments by zipping them first. This is such a rare use-case that zipping resolves the issue insofar as it affects the decision I think.

You can. I can. But not everyone can. For universal usability, it is important that arbitrary files should be able to be attached, as they can in Slack AFAIK.

I try the Gitlab kanban board every time there's a new release and it's just miles behind waffle.

Maybe, but it is miles ahead of not using one at all, which seems to be the current situation :)

It is also gratis, libre, and free of lock-in, unlike Waffle.io.

Five Eyes jurisdiction is not a distinguishing feature given that Netherlands is a Nine Eyes member

RELX NV (Elsevier's owner), by far the most likely organisation to attempt legal interference with the project, is headquartered in the Netherlands, so that country might actually be our worst option.

I'm not sure that reasoning holds. Takedown notices, etc, are IMO much more likely to succeed in the USA than in the Netherlands, especially under those countries' current governments and judiciaries, regardless of where the parties involved are based.

Personally, unless it happens to be the same instance I already get paid to use, I won't add another digital workplace site to the already-too-large list of ones I have to use.

You won't start using GitLab.com because it's one site too many for you, but you kind of imply I should start using Slack (or Mattermost, or Gitter) even if my list of digital workplaces is just as large as yours? That seems a bit ... inconsistent ;)

I find the Gitlab user experience vastly inferior to GitHub. Basic daily tasks require many navigation clicks to reach (for example, viewing the output of a CI job).

GitLab.com is equal to or better than GitHub.com in this respect. (I just checked the output of a CI job on GitLab.com.) Maybe you are running an old version of GitLab, or have it configured differently.

There is really no lock-in with GitHub.

That is not a universally-shared perspective :)

The API allows full access to repos, wikis, issues, comments, reviews, etc.

At the moment.

And having access is just the first piece of the puzzle. I don't have the time to write and maintain code for reliably transforming GitHub API output into data structures suitable for GitLab/Gogs/whatever. Nor does anyone else with ContentMine, AFAIK.

@joeyh wrote and maintained github-backup, but then he removed everything from GitHub, so I wouldn't count on it staying maintained much longer.

If there is concern that GitHub might disappear, the simplest thing is to setup a regular backup of the GitHub presence we already have (to Gitlab if you like)

The only way I can see to do this is to try to automatically trigger GitLab's import from GitHub feature, which AFAICT is intended only for human-supervised triggering on a one-off basis, not for unsupervised triggering on a repeated basis.

So, backing up from GitHub to GitLab is potentially quite a bit of work. Backing up GitLab is just a matter of running the export command: see next point.

rather than migrate to a new service that offers the exact same backup possibilities.

They are not the exact same. GitHub just has the API, Git, and web interfaces; GitLab has all of those, plus comprehensive import/export functionality.

However, I think that's a waste of time as there is no reason to believe such a disappearance is on the horizon.

Is that the approach to risk management and backups that ContentMine should be taking? ;)

The decision to move to Github from Bitbucket was taken in in-person meetings at the beginning of the project, on the basis that it was technically superior, had free public repos, and the open source community was taking off here in an unprecedented way.

Thanks, this is useful to know.

BTW, BitBucket has always offered free public repos, so that isn't likely to have been a reason to move. The other two features, I grant you :)

Back then there were just a few repos to migrate

@petermr told me yesterday that there are still ~50 repos on BitBucket, most of them of at least that vintage, many of them un-migrated or only partly migrated, and that he is not yet convinced are worth migrating, even if it causes users confusion.

Was ContentMine continuing to add repositories and code to BitBucket even after "moving" to GitHub?

@addshore

If ContentMine did NOT migrate from GitHub to GitLab, then:

  • It would keep its current community

Nothing prevents existing voluntary contributors from leaving, even if ContentMine keeps GitHub.

  • It would stay visible

With activity on GitLab.com and a mirror on GitHub.com, it would be even more visible than it is now.

  • No time would have to be spent on it

A lot of time needs to be spent on it, either way, and GitHub.com will in the long run be more time-intensive due to having less powerful organisational features and therefore requiring more manual workarounds.

@mjw99

Given that it has not been migrated over completely to its current location, this seems illogical and will generate more problems than it solves. Before even proposing and debating a second move, you should actually migrate to GitHub properly, canonicalise these repositories and then retire any old repositories.

I am sympathetic to the approach you describe: I can see the appeal.

However:

  1. @petermr told me yesterday that he is not keen to move some of the content on BitBucket to GitHub (or indeed anywhere). :s This is evidently at odds with your wishes and @tarrow's . It is an issue that I will address at #5.
  2. the approach you describe is not the only valid, systematic approach, and it likely requires more work than migrating directly into GitLab.com from both GitHub and BitBucket. This is due to the power of Git to gracefully handle multiple remotes, plus GitLab's one-button import ability from both GitHub and BitBucket.

What will become harder (explaining why I was so keen to reach reasonably swift resolution on this issue), is to apply a simple and systematic approach once the GitHub repositories start to have webhooks, issues, and wikis, etc, in use, as these will likely conflict with any already in place on BitBucket.

You underestimate the disruption a migration causes to the developers in a project.

I hope not. For success, the interfaces should be familiar, all the necessary resources (issues, history, wiki, etc) present in the new location as they were in the old and at least as functional (no broken links; CI working; ...), and the developers undogmatic.

Focus on countering code regression; nothing quite scares prospective developers away as code that used to compile.

@petermr and I agree the regressions are important and must be fixed ASAP.

Even so, two new developers have been added to the project very recently despite code that failed to compile. I think we might have been more hesitant had we known more about the information management issues ;)

@chreman

Switching the service does not solve the problems the project needs to be solved, which are those of long term software strategy,

Several GitLab features would help us handle this better than GitHub, IMO:

getting contributors in and keeping them.

Several GitLab features would help us handle this better than GitHub, IMO:

Also, GitLab.com is a really nice experience, IMO: better than GitHub.com. It is growing fast, and has a vibrant community that, compared to GitHub's community, is readier to report bugs and push patches. It appeals to such people, AFAICT, because it allows them to do that to the hosting service itself. It's basically much more hacker-friendly than GitHub. GitHub feels sterile and corporate by comparison.

If GitLab continues to grow/be stable, then a switch at a later time will not be a problem, given that convenient import functions already exist.

You have just affirmed the consequent ;)

I.e. switching at a later time might prove completely impractical, despite GitLab's best efforts, if GitHub increases its lock-in.

Until then we should focus our resources on the critical issues, which are social and organizational, and only secondary technological.

These are not cleanly separable issues. ContentMine has two new starters, eager to work, but baffled by some of the information management and tool choices ContentMine has made. E.g. to a new contributor, it is in many cases non-obvious where to:

Consolidating the content from Slack, Discourse, GitHub, and Google Drive into GitLab.com:

Any additional consideration/side project we have to make during that time unnecessarily adds complexity and takes scarce brain/time resources away.

ContentMine already exhibits a number of anti-patterns in its information management. Those drain time and energy, in the form of co-ordination overhead, decision fatigue, etc, and will do so increasingly unless addressed.

This is a real concern to at least some of us, and I filed this bug report primarily to address it.