bionode / bionode

Modular and universal bioinformatics

bionode.io

MIT License

313 stars 39 forks source link

Collaboration with BioJS (same module system and package manager) #9

Closed bmpvieira closed 9 years ago

bmpvieira commented 10 years ago

BioJS was initially a registry of browser components for biological visualization. Bionode is more oriented for data manipulation (finding, parsing, analyzing, etc) and is more similar to other Bio* libraries like BioPython, BioRuby, etc. When possible, Bionode modules work client and server side while BioJS worked only on the Browser. Consequently, there was no overlap between the two projects. Now, BioJS no longer wants to be just a repository, and also wants to work server side. This could be an opportunity for both projects to work together to avoid duplicated efforts.

However, there's one major point where we're not agreeing, Bionode uses Node.js CommonJS with Browserify and BioJS team wants to move their modules to AMD. BioJS argues that AMD is the only system that allows for live module loading while others require building. Bionode went for Browserify because it allows using Node.js core features (like Streams) on the browser. Browserify supports live reload with tools like watchify, gulp, beefy, etc.

The BioJS team suggests discussing the following possibilities for integration between both projects:

use AMD on the server e.g. use RequireJS as Node.js module
use a CommonJS bundler for the client load compiled modules in the browser (Browserify, RequireJS,...)
define modules two-way: UMD (universal module definiton) [4] specify them as AMD and CommonJS module (and global browser constants) in parallel e.g. commonjsStrictGlobal.js
Bionode ideas
Stop talking

We hope this issue gets some feedback from the bioinformaticians, Node.js and JavaScript communities.

max-mapper commented 10 years ago

There are lots of tools that do bundling for you, e.g. http://requirebin.com, http://wzrd.in/, http://jspm.io/, https://normalize.github.io/. I'd love to see the bio.js folks make critiques of these, as it sounds like they have strong opinions on what "building" means and I'm not sure what their actual requirements are.

It would be pretty cool if the bio.js modules were written as small modules rather than one monolithic repo and used the npm package format so components could be easily shared between node/browser.

wilzbach commented 10 years ago

Hi thanks for opening this issue :) I am Seb from the BioJS guys and very happy that we are having this discussion.

One of the best examples is that we have at least three different implementations of a FASTA parser - so I really want to modernize the BioJS codebase and introduce modules. My dream is that the Bionode project can also profit from our achievements.

We don't have a strong opinion on "building" - it's more that our current build process is totally annoying (it takes about one minute for bundling) and AMD seemed to be the easiest to use for client development as it requires no installation and therefore is (1) independent from the dev platform, (2) easy to use for starters.

During the last two months I made a lot of experience with CoffeeScript via an AMD loader and my overall summary is that it works most of the time fine, but has some pitfalls (shimming of third-party libs, debugging only works nicely in Chromium, relative module path don't work that smoothly for bundling with CS) - hence my opinion isn't fixed here and heading for an overall standard like the upcoming ES6 modules would be fantastic.

BTW there is another large JavaScript component with bioinformatics context (the JBrowse genome browser) which uses AMD.

I'd love to see the bio.js folks make critiques of these

They are all great projects - I tried to group your suggestions and added points that could be problematic.

IDE in the browser

http://requirebin.com:

what if you want to edit n > 3 files
does it work offline?
how ugly is the debugging here? like in JSFiddle?

CDN bundler

http://wzrd.in/

is it possible to use this as an development setup? (as they bundle existing requireJS modules on-the-fly)

below this line the real discussion (possible alternatives) begins

Reload with watchify, gulp, beefy

we would loose the aim of having a really simple setup

ES 6 module loaders

http://jspm.io/ (universal), https://normalize.github.io/ (minimal), https://github.com/ModuleLoader/es6-module-loader (just ES6 modules), https://github.com/systemjs/systemjs (universal)

I haven't yet found time to play around with the ES 6 modules
If this downcompiling to AMD works fine - that would be an awesome solution :)
Has anyone already experience with any of those loaders (or ES 6 modules in general)?

alanrice commented 10 years ago

Now, BioJS no longer wants to be just a repository, and also wants to work server side.

Is there any discussion on biojs's side about the desire to work server-side? I might have missed it but I didn't see it mentioned from a quick scan of the mailing list or issue tracker. Is this community driven or an organisation decision? Are there particular project goals outlined already?

max-mapper commented 10 years ago

@greenify hi, nice to meet you!

we would loose the aim of having a really simple setup

the steps for an app that uses beefy or watchify are:

clone repo
install node.js
run npm install
run npm start and/or npm test

e.g. here's an example of a start script using beefy: https://github.com/maxogden/dat-editor/blob/master/package.json#L7

note that beefy does not need to be installed globally, npm will use the node from the node_modules directory

to run a test suite you can use npm test, e.g. try the above steps on a module like https://github.com/maxogden/multiplex

browserify-cdn can be run locally, but beefy is a more sane setup when it comes to debugging modules

wilzbach commented 10 years ago

Is there any discussion on biojs's side about the desire to work server-side?

Our goal is to have reusable JavaScript modules (e.g. I/O, REST APIs, algorithms, ..) which would be a waste of effort if one can't use them for server-side JavaScript. However the main priority of BioJS "being a framework for visualizing biological data" did not change, it's more that a few other devs and I want to focus more on the framework part and want to make "BioJS fun to use".

We want to achieve the first big step towards a modular BioJS in the first digital "BioJS hackathon" in the begin of August and (hopefully migrate a lot of the existing components).

for more info: https://biojs.github.io/code/2014/07/04/announcing-hackathon.html https://groups.google.com/forum/#!topic/biojs/h19BgjDGdBo

ariutta commented 10 years ago

Hello,

I am with the wikipathways.org team and have spent much of the last two years immersed in applying JS to the bio domain. My experience leads me to support making libraries dual-use for browser and server whenever practical.

We have been using CommonJS and Browserify, but I've been hearing good things here in the Bay Area about Webpack: http://webpack.github.io/. It supports both CommonJS and AMD. On Jul 19, 2014 8:31 AM, "greenify" notifications@github.com wrote:

Is there any discussion on biojs's side about the desire to work server-side?

Our goal is to have reusable JavaScript modules (e.g. I/O, REST APIs, algorithms, ..) which would be a waste of effort if one can't use them for server-side JavaScript. However the main priority of BioJS "being a framework for visualizing biological data" did not change, it's more that a few other devs and I want to focus more on the framework part and want to make "BioJS fun to use".

We want to achieve the first big step towards a modular BioJS in the first digital "BioJS hackathon" https://biojs.github.io/code/2014/07/04/announcing-hackathon.html in the begin of August and (hopefully migrate a lot of the existing components).

for more info: https://biojs.github.io/code/2014/07/04/announcing-hackathon.html https://groups.google.com/forum/#!topic/biojs/h19BgjDGdBo

— Reply to this email directly or view it on GitHub https://github.com/bionode/bionode/issues/9#issuecomment-49512456.

bmpvieira commented 10 years ago

Hi @ariutta, I actually went for Webpack first since it looked like to most flexible option, but then started switching to Browserify because of the Node.js core support.

ariutta commented 10 years ago

That's our current strategy as well. Having streams both browser- and server-side opens up some exciting possibilities, as used in the Highland library.

For the AMD vs. CommonJS issue, the lodash library handles it like this, and underscore does this. Would following one of these patterns solve this issue?

wilzbach commented 10 years ago

Greetings back to everyone,

@maxogden: You are right - my argument about having an uncomplicated setup is very weak - it takes me about 60 sec to clone and run. Probably I am just having too much contact with "Windows"-persons. From all what I read so far I understand why you picked CommonJS ;-)
AMD: The idea is that everyone who wants to use the BioWeb modules should be able to do so. So yeah (@ariutta) sth. like UMD fragments could be used to wrap the lib for AMD loaders (one can also use UMD to support CJS in AMD modules). However using two different technologies like AMD and CJS in parallel cries for troubles ...
Assume BioJS would also use CommonJS (for the shared modules)

If I read Bionode.js correctly Bionode consists just of a sequence class. I saw that there is a separate repo for parsing FASTA files from @alanrice. So what is the advantage of splitting the code in tiny plugins? At least I am not sure whether it really makes fun to maintain n > 10 node modules.
I didn't know that the draft for the ECMAScript 6 was so far - what is your opinion about the upcoming standard (section 15.1)? Loaders like SystemJS seem to be very promising ... Even tough the ES 6 module spec is very young and it is unknown whether it is final nor how it's future will look like, they call it a "standard" for a reason..

mikolalysenko commented 10 years ago

I'm a little bit late to the game here, but I would like to chime in with my own thoughts on the CommonJS vs AMD debate, and how I see it within the broader context of evolving software ecosystems:

First of all, there just isn't much difference between AMD and CommonJS. AMD has a slightly clunkier syntax and is more cumbersome when you are building for production deployments (since you still need to bundle everything anyway), while CommonJS requires you to use some tooling for live reloads during development. On the whole, these variations amount to bike-shed levels of significance with neither side having enough differences to give it any real leverage.

However, CommonJS has one towering advantage over AMD that so far has not been directly addressed, which is that it is default package format for node.js and npm. npm is the fastest growing software ecosystem in any language and for good reason - it makes it possible to safely use dependencies. Specifically, there are 3 core features in npm which solve this problem:

npm provides a system (semantic versioning) to specify version names and compatibility between different modules, and an interface/module system (CommonJS w/ node_module look up algorithm) which they can use to talk to each other.
Version conflicts in dependencies are handled by recursively installing multiple copies of modules.
The npm registry provides persistent and immutable naming service for modules at specific versions. eg. once you publish mymodule@1.0.0, it is impossible to ever change or modify the contents of that package, so that any users depending on that version can be guaranteed it will always work as in their test cases.

Any package manger which fails to address all 3 of these points fails to ensure the correctness of modules with dependencies:

The first point is a bit trivial, but I bring it up since without some agreeing to some conventions then there is no way to proceed. In a sense, these choices are the most arbitrary ones in npm, but they are a logical necessity for what follows. If you don't take a stand, you will end up with something like bower, which is a "package manager" in the same sense that wget is (ie it downloads stuff for you, and that's about it).
"Recursive dependency copying" is the most widely recognized innovation in npm. If a package manager installs all dependencies within a single flat namespace, then you can't install multiple versions of the same package (the "diamond" dependency problem). For example, if you use modules X and Y in your project, and X depends on Z@1.0.0 and Y depends on Z@2.0.0, where the versions of Z are incompatible, then it is impossible to use both X and Y in your project at the same time. This sort of issue shows up all over the place taking Z=jQuery/Angular/express/whatever framework you like, and X/Y being various flavors of plugins. While this might seem obvious, it is surprising that many "mainstream" package managers like go-get still fail to solve or even address this issue.
Persistent naming is a more subtle issue, but it is necessary to ensure that projects do not fail after remote repositories or services go down. In general, any module system needs to provide some naming service to ensure that previously published modules do not silently change. Even otherwise decent package managers, like component or Rust's cargo, fail to address this problem. As an illustrative example of how inconsistent naming breaks software, consider the ill-fated Kickstarter for Haunts:
- Go screwed over the inde game haunts

In that situation, "not even the original programmer could compile the original code on his original machine" due to dependencies silently and remotely changing. Needless to say the project was not able to deliver on their promises to their backers.

But more broadly, failure to take dependencies and interoperability seriously contributes to a hostile environment that discourages code reuse. Fear of dependencies pushes developers to build closed fiefdoms with layers of redundant and excruciatingly boring functionality. This is the sort of "framework hell" that many in the node community complain about. Working with a system like npm makes it not only possible, but even preferable to split code into as many small dependencies as possible. In my opinion, this makes it more fun to write code using CommonJS and npm, since you can skip all the boring details with simple canned modules and cut right to the most interesting parts of whatever you are trying to do.

wilzbach commented 10 years ago

Wow great final speech :)

even preferable to split code into as many small dependencies as possible.

I was questioning whether it really makes sense to ship 50 lines as a separate package. At least for the BioJS core our idea was to have it as a single package (and maybe if it grows extensively split it into I/O, Algorithms etc.)

Yeah, I do grasp the advantage of tiny packages for better modularity, but

what happens if you want to change your build chain (e.g. different testing, code analysis)?
you loose control over your own core (because it is split into 50 packages, maintained by 20 "dead" people)
how do you generate a global documentation or achieve bundling (if they are in separate repos - well there are git submodules, but then having one repo would be a lot simpler)

dasmoth commented 10 years ago

For what it's worth, Biodalliance agonised over this for a while, and eventually settled on CommonJS, largely because, like @mikolalysenko, I thought it was the technology with most momentum. Also, at the time Browserify seemed like the best bundling option (although I realize that's an area where things are moving quite rapidly).

bmpvieira commented 10 years ago

I prefer having lots of small modules (even if only 50 lines) instead of a bigger core that no one can read. Each module should do one thing right, have their one tests and issues tracker. Looking at the tests should make obvious how the module is used and works. In addition, docs and examples should be provided in the README.

Having many independent modules would allow any other project (even outside biojs/bionode) to pick the modules they need for their specific goal. Some interesting integrations with other JS projects could come from this.

Small modules are easier to maintain, understand and fork if needed. They facilitate more pull requests. We should try to agree on testing tools, but each module maintainer could use whatever they want as long as doing "npm test" just works and the README has badges.

I don't think you loose control. If no one wants to maintain a small module that has issues, then that functionality wasn't so needed after all. I'm more afraid of big modules that really only have one dev that knows what's going on overall and then that dev abandons the project.

You can later on have more higher level modules that require a group of small modules. For example, after we have stable parsers (bionode-fasta, bionode-sam, etc.) we can have a bionode-io that requires all those parsers and does some fancy things like auto detecting file formats. Those parsers could also be reused by a project like transformer We could also have meta modules that simply require groups of modules, for example bionode-phylogenetics would require modules popular for phylogenetics, or bionode-all would require all bionode-* modules. So you could just do "npm install bionode-all" on a university cluster.

I haven't found the right approach for documentation, but if we agree on some common structure it shouldn't be too difficult to concatenate all *js comments or READMEs and build a fancy html with global docs. I would like to have something similar to underscorejs.org, but in that case that's a hand curated html.

bmpvieira commented 10 years ago

If you haven't yet, I think everybody here should see Max's talks about his views of open source, modularity and the NPM ecosystem (among other things).

https://vimeo.com/77376239 https://webrebels.23video.com/the-lebron-stack-its-a-slam-dunk-by-max https://www.youtube.com/watch?v=8gM3xMObEz4

https://twitter.com/maxogden/status/494231617590673408

wilzbach commented 10 years ago

Thanks for sharing Max's talks with us :) I am really happy about this active discussion and I hope you don't mind my critical words from an outside bionode perspective.

I prefer having lots of small modules (even if only 50 lines) instead of a bigger core that no one can read

I love files that have less than 100 lines of code. So I think we have a total consensus on how we should define modules. It was more that I am still voting for packaging multiple modules into one package (and repo).

Consider this very raw structure:

bio.core

global interfaces (for components)
helper functions

bio.io

Fasta, SAM, BAM , Clustal, Newick

bio.models

Sequence, Tree, Feature, ...

bio.algorithms

Tree algorithms

bio.rest

Uniprot
EBI services ...

(this is like any other Bio-X project might look like)

If I get the moduletopia bill of rights (do one thing and do it well) correct, then Max suggests to write a module in "anything that compiles to JS" because that is "easier to maintain and test". However I would be very keen to know how you plan to deal with these problems (if they are separate packages/repos):

no coding standards / styles
various coding languages
various build chains, testing frameworks
none or different documentations
verified / reviewed code, continuous integration
one-time-submitters (this is totally fine because that is how science works)

What would be so problematic about having a package for every of those different categories? People can still edit the 50-line FASTA parser module and send a pull request (or get write-access to the io-repo). However we can guarantee that our framework is reviewed and stable (and fix all the other disadvantages mentioned above). Would that be a comprise that suits all of us?

BTW I watched all the talks from Max and I would like to add some quotes of him: 1) "open source is writing a library that people can send pull requests to" 2) "it is important that at the core module there is bunch of people who have consensus - [the] dissent happens on top of the core module"

Future: interesting points for a discussion

1. Core

I know that you currently have the Sequence class there, but what is really essential to all projects that work with biological data in JavaScript?

2. Package / module architecture

So for example all parsers should behave the same so it definitely makes sense to define an "interface" for all parsers. Grouping modules into categories seems natural to me and helps people to quickly find what they want (or whether it is not there) Do you have any plan about this other than Bionode's short/midterm goals?

3. Common rules

I hold the opinion that a library should force its devs to obey to a minimal set of guidelines (documentation, testing, ...) and if I understand your point of view correctly you share this basic opinion with me.

So there are two points to discuss here a) to which degree there will be rules (and how are we going to check them) b) how those rules/templates should look like

BTW I do like your bionode-template as it gives users a quick start, but I am bit afraid of creating a lot of redundant data (so upgrading/changing stuff could be horrible)

Personal comments

"Looking at the tests should make obvious how the module is used and works"

How are you going to achieve that people will follow the "rules" (Read-Me, tests, build process) we defined?

Small modules are easier to maintain, understand and fork if needed. They facilitate more pull requests.

Maybe I might just missed the npm revolution, but do you know any of the other Bio-X frameworks that uses more than one package?

off-npm dicussion (testing, documentation)

We should try to agree on testing tools, but each module maintainer could use whatever they want as long as doing "npm test" just works and the README has badges.

Interestingly some people in the BioJS community said you do it exactly in this way (promote a default testing framework, but still leave the developer the option to choose his own framework of choice). I hold the opinion that testing frameworks are really similar and it makes maintainability a lot easier (at least for core components).

I haven't found the right approach for documentation

I really fancy the way AngularJS does its documentation. This "edit-in-plunkr" button for the example snippet is awesome!

max-mapper commented 10 years ago

Some feedback based on my own personal experience:

no coding standards / styles
various coding languages
various build chains, testing frameworks`

Those should be up to the author to choose, to encourage a healthy ecosystem and not raise the barrier to entry. Top-down decision making in these areas will only stagnate the community. What matters more is that people in the community emphasize good, clean and simple APIs that make composability easy.

none or different documentations
verified / reviewed code, continuous integration

If a module has no documentation it won't get used. Good module authors use cloud CI tools like travis, appveyor, testling CI and saucelabs, the more you add as a maintainer the more success your module will have. as for verified/reviewed code, I think the entire "github flow" as they call it (collaboration, issues, pull requests) addresses that

one-time-submitters (this is totally fine because that is how science works)

I'm not sure what this means exactly

On multiple modules per repo

Also, it works best when every module is it's own repository. it gets really annoying when you have multiple repositories with NPM because sometimes you need to use a git url to get a specific version of a module, e.g if I wanted to use a personal fork of a module instead of the version on NPM.

"dependencies": {
  "bionode-ncbi": "maxogden/bionode-ncbi#perf-improvement"
}

But you can only have 1 entry point in a repository, which is the top level package.json file.

On rules

Rather than specific rules I'd rather see shared values and shining examples that represent those shared values, e.g. reference modules. There are lots of modules on NPM that help make authoring NPM modules easier, e.g. https://www.npmjs.org/package/testlingify

On scope

From the 'raw structure' above, it sounds like biojs has a lot of surface area. For many of the components, e.g. Fasta, SAM, BAM , Clustal, Newick, those should just be standalone modules that have no dependencies on any other part of bio.js. Then there could be a framework/convenience module that wraps all of the components into a sort of grab-bag.

That way you have the ability to swap out components in the future with better implementations that still conform to the same API. You also give people the option to build their own framework by choosing the individual components they need, or they can simply use the big grab-bag module if they don't want to think about it.

This is how https://github.com/Raynos/mercury, https://github.com/raynos/http-framework and https://github.com/npm-dom/domquery are set up, and I think it works very well.

wilzbach commented 10 years ago

one-time-submitters = people who only contribute once One example is someone who needs to have a parser/visualization for his PhD (or paper) project and after he has achieved his goal you will never hear from him. So basically he just sends you his source code (which again is great). With a shared repo you wouldn't loose his contribution.

If a module has no documentation, it won't get used. Good module authors use cloud CI tools like travis, appveyor, testling CI and saucelabs,

Here is one example why I am so worried: In this fast changing JS-world it could be totally possible that on the next day there is no travis. Who is going to change >200 repos with Duplicated code without ownership? Fun fact: If you would have a package.json etc. for every C++ class in the chromium project, this would sum up to be more than 30k modules for one project (without external dependencies).

Furthermore you say "good module authors". Do you really expect scientist who are inexperienced in JavaScript to know all the tools and tricks? That's why I am voting to make their integration and learn-process as easy as possible and avoid giving them the responsibility to set up CI testing etc. (they won't do it - and still their algorithm might be great).

On multiple modules per repo

The way AMD handles this is that one defines a index.js / main.js file (a simple dict) which wraps all the modules together.

fasta: fasta,
sam: sam,
newick: newick,

I know that this could lead to some unneeded code (you will receive the 50 lines of an FASTA parser even if you are only interested in the Newick parser), but isn't optimization the root of all evil?

I see the advantages of an distributed, uncoordinated project, but at BioJS there will soon be people working full time for a long-term period. So I hold the opinion that it shouldn't be that difficult to review contributions, "emphasize good, clean and simple APIs" and coordinate the development for the essential packages.

rules

not raise the barrier to entry

Sorry to interrupt you there. Isn't the barrier three times higher if every component uses a different language, style or build tools. At least for BioJS we had this motto "If you know how one component works, then you know how all components work". To be honest, I would love to code in CoffeeScript, but I intentionally planned to avoid it for a core lib to lower the entry barrier for others.

max-mapper commented 10 years ago

In this fast changing JS-world it could be totally possible that on the next day there is no travis. Who is going to change >200 repos with Duplicated code without ownership?

Travis came out of the Ruby community :) There are various ownership models that end up being used anyway e.g. github organizations or https://github.com/rvagg/node-levelup#contributing, and doing things like updating 200 repos doesn't take that much time anyway since it's a well defined task that is scriptable. Plus with NPM you can just keep using the old version that works.

Isn't the barrier three times higher if every component uses a different language, style or build tools

I think you are conflating using + authoring. Everything I've been talking about is on the topic of authors. From a users perspective the way a module is tested, written, etc is opaque. They just require it and use it. From an authors perspective they just have to be aware of community conventions and expectations around testing, APIs etc. A great way to ensure these are met is by making test suites into modules e.g. https://github.com/rvagg/abstract-leveldown && https://github.com/maxogden/csv-spectrum.

Another way to ensure interop is to do what ndarray did and have a base module that implements a data structure that gets passed into various higher level modules, but in a way where the higher level modules don't need do directly depend on ndarray. https://github.com/mikolalysenko/ndarray/wiki/ndarray-module-list#core-module

max-mapper commented 10 years ago

To summarize my thoughts, I really think it would be shame for bio.js to establish it's own "island" of culture that isn't interoperable with the thousands of modules on npm today. A lot of what it looks like bio.js wants to work on overlaps with other existing modules anyway. The nice part of the npm/browserify workflow is that things work in both node and web browsers out of the box, and there are browserify modules to support tons of different transforms and other plugin use cases. The main benefit in my eyes would be that you can invest in the existing npm ecosystem without creating a new isolated one.

KyleAMathews commented 10 years ago

I'll offer another opinion (from a random guy on the internet).

First I agree 100% with everything @maxogden said. NPM/CommonJS/Browserify etc. are all really powerful and growing incredibly quickly and it'd be both a loss to the rest of us who might want to use your modules or would benefit from improvements you make to modules you'd start to pull into your work and an even bigger loss if you can't pull in code you'd need.

Second, you already have "guidelines and tutorials to develop new components" on your website. CommonJS modules are really really easy to write. You assign a function to module.exports and... that's it. Even the most inexperienced scientist JS developer can figure out that.

Then you'd want some way of aggregating all the biojs modules which get published. One easy way is to have a standard tag for all these modules. You could then create a website which queries npmjs.org for the tag and presents them all perhaps with additional curation. Or, if the numbers are small enough, manual curation on a wiki is easy enough to keep up with. http://vimawesome.com/ is a cool example of this. And on tests, having tests are a bonus not the minimum bar. It doesn't matter if random grad student includes tests with their code or not. The really popular modules will get tests and become really solid over time and the one-offs will stay one-offs but still be around to serve as occasional inspiration. But the main thing is to just get the code out there in a discoverable fashion. A chaotic sea of modules with high variability of quality is a much more productive ecosystem than a tightly controlled one.

wilzbach commented 10 years ago

I really think it would be shame for bio.js to establish it's own "island" of culture that isn't interoperable

I really DO want to prevent redundant efforts. Science can't afford the silliness of wasting time and resources. After all that's why I started this discussion with the Bionode people. It is more that I have to convince my community to to totally change their way things currently work - so that is why I keep being sceptical. I hope you don't see my points as a critic on your experience or expertise.

See you later

(nice to hear from you random guy on the internet)

yannickwurm commented 10 years ago

Max wrote:

Top-down decision making in these areas will only stagnate the community.

Greenify wrote:

Maybe I might just missed the npm revolution, but do you know any of the other Bio-X frameworks that uses more than one package?

So bioruby is an excellent example of this: there are/were strict & very defendable top-down rules for contributing to the one and only main bioruby package. While this ensured that only high-quality stuff was accepted, it hurt the community because many enthusiastic potential contributors became discouraged (e.g. doing lots of work and then not getting that accepted into the core repository). So the community and the core functionality lost momentum - today the main package even includes methods and modules that are obsolete (e.g. to connect to dead webservices).

About 2 years ago something changed: Pjotr Prins led the creation of a highly modular "do whatever you want" collection of packages - and and easy way of accessing an overview - called biogems. This has been extremely successful: anyone can create and contribute and get visibility and users for whatever the heck they want. Most of the new work (e.g. parsers for new datatypes, specific apps & analysis tools) happens here:

http://biogems.info

This npm-like package fragmentation has dramatically improved the bioruby community - for the better.

Greenify wrote:

Consider this very raw structure: bio.core [...]

To reuse Bruno's words these could be "meta modules that simply require groups of modules, for example bionode-phylogenetics would require modules popular for phylogenetics, or bionode-all would require all bionode-* modules."

dasmoth commented 10 years ago

On Fri, Aug 1, 2014 at 9:00 AM, Yannick Wurm notifications@github.com wrote:

Max wrote: Greenify wrote:

Consider this very raw structure: bio.core [...]

To reuse Bruno's words these could be "meta modules that simply require groups of modules, for example bionode-phylogenetics would require modules popular for phylogenetics, or bionode-all would require all bionode-* modules."

I think there's an interesting point here. While many of us work in departments or organisations with "bio" somewhere in the title, I think it's a kind-of tricky scope for software. Someone who imports "bionode-phylogenomics" is probably far more to import a general graphics library or something "non-bio" than, say, bionode-96wellPlateManagement (I'm sure there's a better name for that...)

So while having an umbrella community for bio-stuff makes a fair bit of sense, loose coupling seems important.

        Thomas.

wilzbach commented 10 years ago

Hi guys,

just to give you an quick overview on our discussion (and our first call):

During the next days (our BioJS core hackathon) we will port our existing parsers etc. to a modular specification
We agreed to build sth. similiar to Biogems

a) search (hierarchy, naming, tags, ..) b) review (verified, test status, code coverage, github stars, downloads, dependent modules)
We consent to this tiny modularity approach (at least testing it out)
We also agree that there is plenty of scope for mutual collaboration and that there will be periodic discussions in terms of trying to integrate BioNode bits in the BioJS code. You are all invited to join our next BioJS community discussion on August 5th (4.00pm BST, 5.00pm CEST, 11.00am EST, 8.00am PST).

Open discussion points:

1) How do you define a bio-core? For me code that should go inside is something like a "GenericParser" which all components can inherit. Why do you have the sequence class as core?

2) How do we manage to bundle the documentation of all plugins into a Biogems website?

3) What are we going to do if someone writes the first FASTA parser and pushes it to npm - so reserves the name and then disappears. How can we avoid having bionode-fasta, bionode-fasta2 and bionode-fasta3?

bmpvieira commented 10 years ago

Thanks for the call and for inviting Bionode to be an active part in BioJS development. I'm looking forward to what we can do together. I think a great outcome from this call was that we agreed on CommonJS/NPM for modularity. This will make everything easier in terms of collaboration and remixing of components.

Open discussion points:

1) How do you define a bio-core? For me code that should go inside is something like a "GenericParser" which all components can inherit. Why do you have the sequence class as core?

I think bio core should have helpers or util functions that are reused by most of the other bio modules. The Sequence methods fit that description, although the reason why I have them in core is more historical than anything else. The first bionode module started as a way to provide those functions client side to the Afra project, while also being available for server side usage. They should probably be moved to a specific bionode-sequence module, but then the bionode module would be empty since I don't have helper functions for now. In that case, the bionode module could become instead of the "core" module the "meta" module that links the other modules together in some kind of framework.

2) How do we manage to bundle the documentation of all plugins into a Biogems website?

I'm a fan of having literate code and using docco. I think it's a good practice to write the comments as docs when you're writing the code, and that way you don't have to figure out where the documentation is. However, this isn't of course an alternative to a global API doc. As I mentioned, some projects do global APIs manually, like underscore, express, nodejs and socket.io. However, something automatic would be better. I'm currently exploring doxx as it seems to be able to generate a single doc for multiple modules. If doxx can solve most of the problem, we can then tweak it to generate an output with some of Biogems features.

3) What are we going to do if someone writes the first FASTA parser and pushes it to npm - so reserves the name and then disappears. How can we avoid having bionode-fasta, bionode-fasta2 and bionode-fasta3?

I think we must accept that to some extent we will always have minor issues with names and people doing whatever they want. Nonetheless, if someone submits bionode-fasta and then doesn't maintain it, I think that in most cases that person will agree to transfer ownership to the bionode organisation. If we can't reach an agreement with the author, or he is just trolling us/reserving the name, then maybe @isaacs will intervene in those rare cases.

max-mapper commented 10 years ago

Sorry I couldn't join the call, I am in a timezone that was inconvenient.

I joined the biojs channel on freenode but nobody else is in there. Some active channels are #dat, #browserify and #stackgl (stackgl).

It might also be nice to have a 'discussions' repo, like what we do with nodeschool: https://github.com/nodeschool/discussions

I have found IRCCloud to be really nice for making IRC accessible, and they have a great mobile app (push notifications for IRC is really nice).

mikolalysenko commented 10 years ago

One suggestion about how to structure a big project like bionode is to maybe consider working backwards from specific problems. Instead of going whole hog and building up a gigantic utility belt module, it might be more effective to start from some common tasks and work from the simplest abstractions you would like to use. Then you can build small separate "utility modules" instead of one gigantic super module pack that tries to speculatively solve a bunch of problems (that have not even been demonstrated to exist yet).

The other main suggestion I would also have is to avoid putting too much logic into the methods of your core data types, since this tends to create tighter coupling between modules and generally causes problems. Ideally it is better to make lower level modules take only simple datatypes or "destructured objects" and not rely on specific functionality embedded in their methods. This is somewhat contrary to the way people often think about doing things in languages like Java, where usually you get large mega utility classes or data structures with hundreds of methods. If you add functionality by creating functions which accept objects, rather than objects which contain functions, it is much easier for you to change something without breaking all the code that depends on your core class (ie it avoids coupling/peer dependencies).

wilzbach commented 10 years ago

I followed Max's advice and opened the open discussion points as separate issues, so that everything get a lot clearer :) Normally you should be able to find me on #biojs - you had bad luck.

bmpvieira commented 10 years ago

Thanks @greenify. :)

I agree with @mikolalysenko's last comment. I'm worried that with too much pre-planning we might start over-engineering and creating objects, data models, schemas, etc., for things that don't need them.

One principle to follow that I have in the bionode-template module is:

KISS and don't abuse objects;

Linking to @timruffles's excellent talk "You probably don't want an object".

I prefer bionode.reverseComplement('AGTC') than new Sequence('AGTC').reverseComplement().

Bionode modules should be independent but able to work together by providing Callbacks, Streams and CLI interfaces.

Callback because that's what most Node.js devs are used to and sometimes you just want to do:

ncbi.search('pubmed', '21282665', function(data){
  // do something with data
})

However, Streams are awesome for large data and to build pipelines:

// Using bionode-ncbi, tool-stream and dat
ncbi.search('genome', 'human')
.pipe(filter)
.pipe(fork)
.pipe(dat.genomes)

fork
.pipe(tool.extractProperty('uid'))
.pipe(ncbi.link('genome', 'pubmed'))
.pipe(tool.extractProperty('destUID'))
.pipe(ncbi.search('pubmed'))
.pipe(dat.papers)

Finally, with CLI and the UNIX philosophy (one tool doing just one thing), we remove the need for users to learn JavaScript and allow other communities like R, Python, Ruby, etc., to reuse our tools. In addition, CLI is great for quickly discovering/processing data interactively in the shell.

bionode-ncbi search gds solenopsis | dat import --json
bionode-ncbi search pubmed mouse | tool-stream extractProperty uid > pmid-list.txt

bmpvieira commented 10 years ago

This is also being discussed on BioJS Technical Google Groups.

wilzbach commented 10 years ago

If you are interested what is happening in Munich, I just published our bits from the BioJS core hackathon (Munich): Day 1. Please note that it is just Day 0 ;-)

bmpvieira commented 10 years ago

Today I participated on BioJS hackathon/project call and this is what a wrote on their Google Docs (@manuelcorpas can I also post the link for the whole thing here?):

Bionode views about BioJS

Bionode purpose is to build components that can be combined into pipelines/workflows for doing bioinformatics. We try to work on the browser but are more focused on the server. BioJS could be great to visualize data generated by Bionode.

We hope that BioJS modules will be on NPM and can be easily integrated in other projects without pulling the whole “framework”/project.
Even if BioJS doesn’t share the same interfaces/purposes as Bionode, that’s not a problem. We can just wrap for example a BioJS parser module with Bionode interface and vice-versa.
The BioJS registry is a good idea for any module related to bio*, from BioJS, Bionode or any other project/author. For example, there’s already JavaScript parsers for bio on NPM outside of both projects. What those modules need if better visibility.
BioJS and Bionode should provide guidelines for contributions that help the vision of each project, but not enforce strict rules that scare contributors. Having modules on NPM should be the only things we really need to encourage.
More discussion on GitHub, less on mailing list. Make the discussions more visible, some people can’t join calls or find the gdocs.
Example of how to recognize contributions: https://github.com/bionode/bionode/blob/master/contributors.md http://biogems.info (by field) Contributors.md can also have people that didn’t contribute code. In the same way that the biojs registry will label modules with testing/docs, it could fetch contributors from this file and display it. That info can then be indexed in a page that lists all contributors and each contributor can have a page with a list of modules contributed. Something like: https://www.npmjs.org/~bmpvieira (npm only shows the person that published, BioJS registry should show all contributors).

wilzbach commented 10 years ago

can I also post the link for the whole thing here?

Sure the notes for this call are open on the web. They might are unstructured, so the bits I published yesterday could serve better for a summary of the current status.

Even if BioJS doesn’t share the same interfaces/purposes as Bionode

Bionode: Bionode purpose is to build components that can be combined into pipelines/workflows for doing bioinformatics BioJs: reusuable JS components to represent biological data

I think we share the same higher-level goal: "create JS lib(s) for biological data".

Contributors.md can also have people that didn’t contribute code. In the same way that the biojs registry will label modules with testing/docs, it could fetch contributors from this file and display it.

Cool idea. I am bit worried that our toml starts to get messy. In my humbled opinion a attribution section in your ReadMe file totally fits this purpose and is way more flexible.

manuelcorpas commented 10 years ago

As Seb said you are welcome to share this doc

It was great to have you in our call Bruno. Thanks a lot for your contributions.

Manny

Sent from my iPhone

On Aug 6, 2014, at 12:06 AM, Seb notifications@github.com wrote:

can I also post the link for the whole thing here?

Sure the notes for this call are open on the web. They might are unstructured, so the bits I published yesterday could serve better for a summary of the current status.

Even if BioJS doesn’t share the same interfaces/purposes as Bionode

Bionode: Bionode purpose is to build components that can be combined into pipelines/workflows for doing bioinformatics BioJs: reusuable JS components to represent biological data

I think we share the same higher-level goal: "create JS lib(s) for biological data".

Contributors.md can also have people that didn’t contribute code. In the same way that the biojs registry will label modules with testing/docs, it could fetch contributors from this file and display it.

Cool idea. I am bit worried that our toml starts to get messy. In my humbled opinion a attribution section in your ReadMe file totally fits this purpose and is way more flexible.

— Reply to this email directly or view it on GitHub.

bmpvieira commented 9 years ago

Since:

BioJS is using NPM;
We have been collaborating;
There hasn't been a recent comment here.

I'm closing this thread.

General discussion can continue at http://gitter.im/bionode/bionode and http://gitter.im/biojs/biojs.