Closed heuermh closed 6 years ago
-0 to dist-bio, only because it is similar to my own project dsh-bio
Here is a rabbit hole I started going down looking for inspiration for names https://en.wikipedia.org/wiki/Distributed_computing https://en.wikipedia.org/wiki/Massively_parallel https://en.wikipedia.org/wiki/Embarrassingly_parallel https://en.wikipedia.org/wiki/Amdahl%27s_law#Parallelization https://en.wikipedia.org/wiki/Parallel_(geometry) https://en.wikipedia.org/wiki/Posidonius https://en.wikipedia.org/wiki/Parallel_postulate http://sites.math.rutgers.edu/~cherlin/History/Papers2000/eder.html
The word cloud from the ADAM docs is rather boring
Nice word cloud!
I'm biased, but I like "squark" most at the moment.
I agree with @tomwhite - the Spark variants (squark/speeq) sounds good.
Squark has grown on me. 👍 to it.
I think Squark is too close to Sqoop, which is a trademarked Apache project already in the Hadoop/Spark ecosystem.
I also think it fails the Names derived from “Spark”, such as “sparkly”, are also not allowed.
guideline.
And nothing about it says biology or medicine or genomics to me. Anyone have a favorite biologist? Perhaps this list may inspire.
I personally like the Consider using functional names.
guideline. What is the one-line description for this project? Genomics at scale. Parallel genomics. Distributed genomics.
A library for manipulating bioinformatics sequencing formats in Apache Spark.
As a general rule, any code that does not have a Spark or Hadoop dependency, or does not have a "distributed" flavor belongs in htsjdk.
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark and Parquet.
GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark.
Hmn. It is very close to spark, so you're probably right that it's a violation.
Some bad / terrible alternatives:
frankenstein
named after a famous biologist who also used spark
rddr
RDDer, but dropping the e
is all the rage these days
setter
for when we eventually change to using datasets, has a good dog as a logo option
panspermia
what biology is more distributed than outer space spores?
franklin
another scientist who worked with spark
clusterbam
sharded bams... maybe not good to name ourselves after a banned weapon system though?
borg
highly parallel distributed software, different copyright issues
I'm drawing a blank on anything good. The functional names are fine, but they're very clunky.
Some weird suggestions playing with @heuermh's short/functional descriptions:
And even more weird, based on scattering letters on the words:
I haven't done an extensive search, so it might clash with other products in the wild.
P.S.: just for fun - I realized that my full-name initials fit for a project name - DGS (Distributed Genomics at Scale).
Yeah, a lot of good words in https://en.wikipedia.org/wiki/Panspermia
If only Anaxagoras or Wickramasinghe were easier to spell. :)
A couple of new ones (playing on parallel, distributed, and sequencing):
disq is too similar to Disqus and there is a java project for queue/task executor (https://github.com/intelie/disq)
Names don't have to be unique, they just have to not risk confusion. (Search for "confusing similarity" on the Apache Trademarks page https://www.apache.org/foundation/marks/#principles.) Neither of the examples you cite are in the bio or genomics space, so there is little chance of confusion in a user's mind IMO.
Some more:
bamblaster bamifold parnomics seqstorm splitomics
So far, I think squark is my favorite.
I'd like to compile a shortlist to vote on. Please nominate up to two names to add to the list. Here are mine:
Hmn. I have some new suggestions but I'm not sure they make the shortlist.
Sorry, still in brainstorm mode
For the short list:
Great - if everyone who wants to add something to the shortlist can comment here in the next couple of days I'll put together a vote.
If we go with zapbam we could try to claim this punching lightning bolt logo for some added pow!
I think we're bikeshedding this - let's call it disq and move on. I haven't heard any real objection to disq; it's simple and short - and neutral.
+1 for disq.
We could get a dot bio domain and include the domain as part of the name to help distinguish from disq.us and related. For a logo, I can find someone to do up something like this:
Distributed disq throwing! Maybe with a double helix pattern on the discs.
I'm still a fan of zapbam. I'm happy to move forward though with either name.
👍 for disq in the interest of moving on.
Let's not block this: 👍 to disq
Thanks for all the input! I've carried out the rename of the code here: https://github.com/tomwhite/disq.
I think we can rename this github org and repo, import the code, and complete the governance issues.
Thanks! Closing this as resolved, let's try to resolve the organization #2 and namespace #7 issues next.
Regarding naming, in the meeting a couple of names were suggested:
@tomwhite would also like to put forward the following (in the Spark sequencing vein):
Re: Apache Spark Trademark Guidelines
Software products, whether commercial or open source, are not allowed to use “Spark” in their name, except in the form “powered by Apache Spark” or “for Apache Spark” when following these specific guidelines.
Re: Basic Name Search Considerations