ContentMine / cproject

ArgProcessor and files for basic CMDirectories. Often subclassed. Needs to be separate from euclid and norma
Apache License 2.0
0 stars 4 forks source link

Revision of CommandLine #3

Open petermr opened 8 years ago

petermr commented 8 years ago

The command line has been developed for individual components (cmine, norma, ami) rather than a complete commandline for chaining these together. We have shown that they can be chained in org.xmlcml.ami2.plugins.CommandProcessor. However the syntax was ad hoc, used horrible punctuation, interacted with the shell, was different from the normal style, etc.

word(frequencies)xpath:@count>20~w.stopwords:pmcstop.txt_stopwords.txt
sequence(dnaprimer)
...

I put my hand up and ask for forgiveness. (Even I can't remember the syntax).

This is a proposal to deprecate it (it's only 1 month old) and replace by something less complex and more consistent.

PROPOSAL

Each module (including ami submodules) has a reserved command starting with _, i.e.:

_cmine
_norma
_gene
_identifier
_phylo
_regex
_search
_sequence
_species
_word

These can be chained (the newlines are just for prettiness) as:

_norma --xsl nlm2html
_search disease
_search inn
_sequence dna
_word frequencies
_species binomial
_species genus

The structure here is

_plugin option

This will give rise to a single results.xml as

/cproject/ctree/plugin/option/results.xml

note that the duplicated operations (here _search and _species) are independent and in parallel. The normal arguments in args.xml can then be appended:

_identifier rrid --id.regex myregex.xml
_phylo --ph.newick myfile.nwk --ph.nexml myfile2.nexml

Note that the option is optional for some modules.

Generally this will make commandlines simpler.

tarrow commented 8 years ago

I definitely think that sorting out the possible command line options is a great idea!

I have a few thoughts of my own to add to this: I'm concerned about focussing too much on unification of norma and ami into a single tool. This might sound totally anti to previous comments I have made about simplicity and ease of running but I don't think it is entirely the case. I fear this may be a step backwards (basically making a utility just like ami was before norma was factored out of it). We want to keep code common to norma and ami together (e.g. handling CTrees) but we don't want ami to depend on norma (this code should be migrated to cmine) and we certainly don't want to make cmine (this repo) depend on either of the two and start some circular dependency situation.

However I do think a simple 'frontend tool' that calls both ami and norma could be great! I think we should still provide norma and ami as standalone command line utilities that do their own thing but this doesn't stop us having a third tool (name TBD) that calls both as and when needed. This may be precisely what you were suggesting but I wasn't clear. Perhaps is should even be called cmine (hence why the issue is in this repo?).

This new tool should depend on both ami and norma (obviously) but could also even depend on quickscrape and getpapers etc... A sort of one-stop shop for textmining. This should also be the place in which we do post-mining analysis IMHO (making the html tables and so on).

In terms of the structure for what the command line arguments should look like I haven't yet got a solid opinion. The current situation with (e.g.) ami2-gene --g.gene --g.type.. could also be improved. Currently, for example, --g.gene seems superflous because it is already specified by the name of the command line script. --g.type seems less than intuitive to me because of the g. prefix.

ami2-gene --type typename --project projectname should be all that is needed. The fact that the typename given is a type of gene should be inferred because we are calling ami2-gene. However this doesn't fit well with an extensible plugin architecture because we would need to make a new ami2-plugin for every plugin written.

This obviously changes if we want to run multiple plugins at a single go because we need to know which option goes with which plugin. I'm not sure how best to approach this; perhaps commandname --plugin foo bar baz --baz-thingtovary 2 --foo-differentthing 0 --bar-dump /var/log/bazdump.log

I think this is more intuitive for a command line user than using a '_' as a special character. We also don't need to do any special reservation of commands then. We just look through all plugins in the class path (or perhaps even an external resource (e.g. to keep chemtagger in a seperate jar)) and see if any match those given. If so we load them and pass them all --pluginname-x y. They can then decide what to do with the options.

If desired we could have a --norma-transform transformname, --html-table or even a --getpapers-query zika in this 'super tool'

In any case I think in addition to having a supertool we should look at the state of the command line options in the standard norma and ami binaries for people who want to use them in a standalone fashion.

petermr commented 8 years ago

Immediate comments:

At present we have a list of formally independent commands:

norma xsl -x myfile.xsl
word frequencies --w.stopwords pmcstop.txt stopwords.txt --minfreq 20
sequence dnaprimer
search tropicalVirus --dictionary org/xmlcml/ami2/plugins/dictionary/tropicalVirus.xml
species binomial
gene human
summary datatables
summary frequencies

The general form is therefore

command option [argument, argument...]

The separation must be syntactic. Else how do we tell that species is not another argument of --dictionary

There are the following options:

NORMA nlm2html SPECIES binomial

This is also fragile.

I am aware that this will develop into a full language, but at present we can contain it.

My current preference is for one/some/all of

NL
\n
_

as isolated character strings.

I agree we drop the _species and replace by _ species. This is trivial at present.

petermr commented 8 years ago

Have changed petermr version to use - as command separator (not prefix). Example:

cmine _ norma xsl -x myfile.xsl _ word frequencies --w.stopwords pmcstop.txt stopwords.txt --minfreq 20 _  sequence dnaprimer _ summary datatables

We can easily add another separator if required (e.g. NL).

blahah commented 8 years ago

I'm already wrapping all this in an electron app which will be cross platform - and my deadline for having it all working is CSVconf in less than a month. I suggest waiting to see if what I produce meets the need. There's no real reason to make a wrapper for the command-line when almost everyone will want to use a GUI.

I also think having a new (obscure) way of writing command-line instructions will decrease, rather than increase, usability.

tarrow commented 8 years ago

How will the cross platform app work? Will it just call the local binaries? Or are you some how going to bundle them all together?

tarrow commented 8 years ago

In terms of needing a command separator I think we can avoid it. All we need to do is ensure (probably to be done when we look through how ami processes commands as well) is that the plugins don't take 'free form' text separated by spaces.

To alter your command line as an example: cmine --norma-transform xsl --norma-stylesheet myfile.xsl --plugin word --word-type frequencies --word-stopwords pmcstop.txt,stopwords.txt --word-minfreq 20 --plugin sequence --sequence-type dnaprimer --summary datatables

This makes the pairs of objects --option-name optionvalue free to change order. I think this form of --option [value if required] is much more standard. It also means we could use a standard argument processor rather than having to implement our own.

blahah commented 8 years ago

@tarrow it will be an electron app (which is already cross-platform) with a plugin system. Each plugin will be a node package that wraps some tool in a consistent API, and automatically gets the right precompiled binaries.

petermr commented 8 years ago

Thanks, I am all for using standard tools where possible.

Some immediate points.

... have to go now...

On Wed, Apr 6, 2016 at 10:08 AM, Richard Smith-Unna < notifications@github.com> wrote:

@tarrow https://github.com/tarrow it will be an electron app http://electron.atom.io (which is already cross-platform) with a plugin system. Each plugin will be a node package that wraps some tool in a consistent API, and automatically gets the right precompiled binaries.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/ContentMine/cmine/issues/3#issuecomment-206241816

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 8 years ago

The existing tools can be run on a server in Cambridge, no?

petermr commented 8 years ago

Sure, but the commandline interface was poor which is why I have been hacking it. It's certainly possible to string everything together without separators. It's a question of what is simplest and most robust. If all the options are used it can be a string of 50 arguments.

On Wed, Apr 6, 2016 at 10:58 AM, Richard Smith-Unna < notifications@github.com> wrote:

The existing tools can be run on a server in Cambridge, no?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/ContentMine/cmine/issues/3#issuecomment-206274124

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 8 years ago

well I agree that is not ideal - how about only accepting one plugin via commandline arguments, and if the user wants to run multiple plugins they provide a (YAML?) file containing the plugins and their parameters

petermr commented 8 years ago

I think it's possible to do it by concatenated arguments.

My current architecture is:

cmine [general arguments]
  command option [command arguments]
  command option [command arguments]
  command option [command arguments]

the option used to be called `--type` and we can revert to that. It's not
essential for all comands. However it organizes the output very nicely.

More later...

On Wed, Apr 6, 2016 at 2:29 PM, Richard Smith-Unna <notifications@github.com
> wrote:

> well I agree that is not ideal - how about only accepting one plugin via
> commandline arguments, and if the user wants to run multiple plugins they
> provide a (YAML?) file containing the plugins and their parameters
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly or view it on GitHub
> <https://github.com/ContentMine/cmine/issues/3#issuecomment-206373058>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
petermr commented 8 years ago

I have continued to refactor cmine, norma and ami. It should now be able to develop a commandline syntax we are all happy with.

I have developed CMineParser and CMineCommand as a separation of the functions of ArgProcessor. The parser builds a model of the commands in CMineCommand which can then be edited.

I can create a simple-to-parse syntax, with some agreed conventions.

cmine [args] -c command1 [args1] -c command2 [args2] ...

this is easy to parse and reasonably easy to implement as

cmine [general arguments] 
  command1 [option1Arg] [command1arguments]
  command2 [option2Arg] [command2arguments]

Many arguments are inherited from cmine while others are command-specific