Open petermr opened 8 years ago
I definitely think that sorting out the possible command line options is a great idea!
I have a few thoughts of my own to add to this: I'm concerned about focussing too much on unification of norma and ami into a single tool. This might sound totally anti to previous comments I have made about simplicity and ease of running but I don't think it is entirely the case. I fear this may be a step backwards (basically making a utility just like ami was before norma was factored out of it). We want to keep code common to norma and ami together (e.g. handling CTrees) but we don't want ami to depend on norma (this code should be migrated to cmine) and we certainly don't want to make cmine (this repo) depend on either of the two and start some circular dependency situation.
However I do think a simple 'frontend tool' that calls both ami and norma could be great! I think we should still provide norma and ami as standalone command line utilities that do their own thing but this doesn't stop us having a third tool (name TBD) that calls both as and when needed. This may be precisely what you were suggesting but I wasn't clear. Perhaps is should even be called cmine (hence why the issue is in this repo?).
This new tool should depend on both ami and norma (obviously) but could also even depend on quickscrape and getpapers etc... A sort of one-stop shop for textmining. This should also be the place in which we do post-mining analysis IMHO (making the html tables and so on).
In terms of the structure for what the command line arguments should look like I haven't yet got a solid opinion. The current situation with (e.g.) ami2-gene --g.gene --g.type..
could also be improved. Currently, for example, --g.gene
seems superflous because it is already specified by the name of the command line script. --g.type
seems less than intuitive to me because of the g.
prefix.
ami2-gene --type typename --project projectname
should be all that is needed. The fact that the typename given is a type of gene should be inferred because we are calling ami2-gene
. However this doesn't fit well with an extensible plugin architecture because we would need to make a new ami2-plugin for every plugin written.
This obviously changes if we want to run multiple plugins at a single go because we need to know which option goes with which plugin. I'm not sure how best to approach this; perhaps commandname --plugin foo bar baz --baz-thingtovary 2 --foo-differentthing 0 --bar-dump /var/log/bazdump.log
I think this is more intuitive for a command line user than using a '_' as a special character. We also don't need to do any special reservation of commands then. We just look through all plugins in the class path (or perhaps even an external resource (e.g. to keep chemtagger in a seperate jar)) and see if any match those given. If so we load them and pass them all --pluginname-x y
. They can then decide what to do with the options.
If desired we could have a --norma-transform transformname
, --html-table
or even a --getpapers-query zika
in this 'super tool'
In any case I think in addition to having a supertool we should look at the state of the command line options in the standard norma and ami binaries for people who want to use them in a standalone fashion.
Immediate comments:
norma
retains its independence. as do each ami
plugin.At present we have a list of formally independent commands:
norma xsl -x myfile.xsl
word frequencies --w.stopwords pmcstop.txt stopwords.txt --minfreq 20
sequence dnaprimer
search tropicalVirus --dictionary org/xmlcml/ami2/plugins/dictionary/tropicalVirus.xml
species binomial
gene human
summary datatables
summary frequencies
The general form is therefore
command option [argument, argument...]
The separation must be syntactic. Else how do we tell that species
is not another argument of --dictionary
There are the following options:
,.-_
(checked with SO).
NORMA nlm2html SPECIES binomial
This is also fragile.
I am aware that this will develop into a full language, but at present we can contain it.
My current preference is for one/some/all of
NL
\n
_
as isolated character strings.
I agree we drop the _species
and replace by _ species
. This is trivial at present.
Have changed petermr
version to use -
as command separator (not prefix). Example:
cmine _ norma xsl -x myfile.xsl _ word frequencies --w.stopwords pmcstop.txt stopwords.txt --minfreq 20 _ sequence dnaprimer _ summary datatables
We can easily add another separator if required (e.g. NL
).
I'm already wrapping all this in an electron app which will be cross platform - and my deadline for having it all working is CSVconf in less than a month. I suggest waiting to see if what I produce meets the need. There's no real reason to make a wrapper for the command-line when almost everyone will want to use a GUI.
I also think having a new (obscure) way of writing command-line instructions will decrease, rather than increase, usability.
How will the cross platform app work? Will it just call the local binaries? Or are you some how going to bundle them all together?
In terms of needing a command separator I think we can avoid it. All we need to do is ensure (probably to be done when we look through how ami processes commands as well) is that the plugins don't take 'free form' text separated by spaces.
To alter your command line as an example:
cmine --norma-transform xsl --norma-stylesheet myfile.xsl --plugin word --word-type frequencies --word-stopwords pmcstop.txt,stopwords.txt --word-minfreq 20 --plugin sequence --sequence-type dnaprimer --summary datatables
This makes the pairs of objects --option-name optionvalue
free to change order. I think this form of --option [value if required]
is much more standard. It also means we could use a standard argument processor rather than having to implement our own.
@tarrow it will be an electron app (which is already cross-platform) with a plugin system. Each plugin will be a node package that wraps some tool in a consistent API, and automatically gets the right precompiled binaries.
Thanks, I am all for using standard tools where possible.
Some immediate points.
... have to go now...
On Wed, Apr 6, 2016 at 10:08 AM, Richard Smith-Unna < notifications@github.com> wrote:
@tarrow https://github.com/tarrow it will be an electron app http://electron.atom.io (which is already cross-platform) with a plugin system. Each plugin will be a node package that wraps some tool in a consistent API, and automatically gets the right precompiled binaries.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/ContentMine/cmine/issues/3#issuecomment-206241816
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
The existing tools can be run on a server in Cambridge, no?
Sure, but the commandline interface was poor which is why I have been hacking it. It's certainly possible to string everything together without separators. It's a question of what is simplest and most robust. If all the options are used it can be a string of 50 arguments.
On Wed, Apr 6, 2016 at 10:58 AM, Richard Smith-Unna < notifications@github.com> wrote:
The existing tools can be run on a server in Cambridge, no?
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/ContentMine/cmine/issues/3#issuecomment-206274124
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
well I agree that is not ideal - how about only accepting one plugin via commandline arguments, and if the user wants to run multiple plugins they provide a (YAML?) file containing the plugins and their parameters
I think it's possible to do it by concatenated arguments.
My current architecture is:
cmine [general arguments]
command option [command arguments]
command option [command arguments]
command option [command arguments]
the option used to be called `--type` and we can revert to that. It's not
essential for all comands. However it organizes the output very nicely.
More later...
On Wed, Apr 6, 2016 at 2:29 PM, Richard Smith-Unna <notifications@github.com
> wrote:
> well I agree that is not ideal - how about only accepting one plugin via
> commandline arguments, and if the user wants to run multiple plugins they
> provide a (YAML?) file containing the plugins and their parameters
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly or view it on GitHub
> <https://github.com/ContentMine/cmine/issues/3#issuecomment-206373058>
>
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
I have continued to refactor cmine
, norma
and ami
. It should now be able to develop a commandline syntax we are all happy with.
I have developed CMineParser
and CMineCommand
as a separation of the functions of ArgProcessor
. The parser builds a model of the commands in CMineCommand
which can then be edited.
I can create a simple-to-parse syntax, with some agreed conventions.
cmine [args] -c command1 [args1] -c command2 [args2] ...
this is easy to parse and reasonably easy to implement as
cmine [general arguments]
command1 [option1Arg] [command1arguments]
command2 [option2Arg] [command2arguments]
Many arguments are inherited from cmine
while others are command-specific
The command line has been developed for individual components (
cmine
,norma
,ami
) rather than a complete commandline for chaining these together. We have shown that they can be chained inorg.xmlcml.ami2.plugins.CommandProcessor
. However the syntax was ad hoc, used horrible punctuation, interacted with the shell, was different from the normal style, etc.I put my hand up and ask for forgiveness. (Even I can't remember the syntax).
This is a proposal to deprecate it (it's only 1 month old) and replace by something less complex and more consistent.
PROPOSAL
Each module (including
ami
submodules) has a reserved command starting with_
, i.e.:These can be chained (the newlines are just for prettiness) as:
The structure here is
This will give rise to a single
results.xml
asnote that the duplicated operations (here
_search
and_species
) are independent and in parallel. The normal arguments inargs.xml
can then be appended:Note that the option is optional for some modules.
Generally this will make commandlines simpler.