cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

RFC: Making configurations less verbose #22830

Open Dr15Jones opened 6 years ago

Dr15Jones commented 6 years ago

Recently I found myself having to hand write a number of configuration files for testing. The process was very tedious. Even a simple example requires lots of typing

import FWCore.ParameterSet.Config as cms

process = cms.Process("Test")
process.source = cms.Source("PoolSource", fileNames = cms.untracked.vstring("foo.root","bar.root") )

process.maxEvents = cms.untracked.PSet(input = cms.untracked.int32(10) )

process.options = cms.untracked.PSet(numberOfThreads = cms.untracked.uint32(4),
                                     numberOfStreams = cms.untracked.uint32(0) )

process.work = cms.EDProducer("IntProducer", value = cms.int32(1) )

process.out = cms.OutputModule("PoolOutputModule", fileName = cms.untracked.string("test.root") )

process.o = cms.EndPath(process.out, cms.Task(process.work) )

One issue is having to type cms.untracked in many different places. cms.untracked acts like a unary operator. python has overloads for unary operators, the best candidate appears to be the bitwise not operator ~. With a trival modification to _ParameterTypeBase to define a __invert__ method one could use ~ in addition to cms.untracked. This would make the example become

import FWCore.ParameterSet.Config as cms

process = cms.Process("Test")
process.source = cms.Source("PoolSource", fileNames = ~cms.vstring("foo.root","bar.root") )

process.maxEvents = ~cms.PSet(input = ~cms.int32(10) )

process.options = ~cms.PSet(numberOfThreads = ~cms.uint32(4),
                            numberOfStreams = ~cms.uint32(0) )

process.work = cms.EDProducer("IntProducer", value = cms.int32(1) )

process.out = cms.OutputModule("PoolOutputModule", fileName = ~cms.string("test.root") )

process.o = cms.EndPath(process.out, cms.Task(process.work) )

Which is less to write, but would take getting used to when trying to read.

Further changes could be done by allowing shorthands for the standard cms. types. I could see two different ways of handling such shorthands (which would not replace the full names, but could be used as alternatives).

Simple variable substitution

One could create the following file in FWCore/ParameterSet/python/Shorthand.py

import FWCore.ParameterSet.Config as cms
i_ = cms.int32
u_ = cms.uint32
i64_ = cms.int64
u64_ = cms.uint64
s_ = cms.string
d_ = cms.double
P_ = cms.PSet
t_ = cms.InputTag

vi_ = cms.vint32
vu_ = cms.vuint32
vi64_ = cms.vint64
vu64_ = cms.vuint64
vs_ = cms.vstring
vd_ = cms.vdouble
VP_ = cms.VPSet
vt_ = cms.VInputTag

Using these simple assignments would allow the following version of a configuration

from FWCore.ParameterSet.Shorthand import *

process = cms.Process("Test")
process.source = cms.Source("PoolSource", fileNames = ~vs_("foo.root","bar.root") )

process.maxEvents = ~P_(input = ~i_(10) )

process.options = ~P_(numberOfThreads = ~i_(4),
                      numberOfStreams = ~u_(0) )

process.work = cms.EDProducer("IntProducer", value = i_(1) )

process.out = cms.OutputModule("PoolOutputModule", fileName = ~s_("test.root") )

process.o = cms.EndPath(process.out, cms.Task(process.work) )

Type as Units

Another way would be to use a helper class which works similar to assigning a unit (e.g. cm) to a value to create the appropriate type. This could be done using the multiplication operator. Python already has a history of using the multiplication operator in novel ways such as l = [0]*3 which is equivalent to l=[0,0,0]. For this case, one could write v = i_*1 which is equivalent to v=cms.uint32(1) or v=i_*[1,2] which is equivalent to v=cms.vint32(1,2). Note that in this version, we do not need a new variable for each container type, instead we can reuse the same symbol and just have it map to all items in a standard python container. Using such a helper function the example could become

from FWCore.ParameterSet.Shorthand import *

process = cms.Process("Test")
process.source = cms.Source("PoolSource", fileNames = ~s_*["foo.root","bar.root"] )

process.maxEvents = ~P_(input = ~i_*10 )

process.options = ~P_(numberOfThreads = ~i_*4,
                      numberOfStreams = ~u_*0 )

process.work = cms.EDProducer("IntProducer", value = i_*1 )

process.out = cms.OutputModule("PoolOutputModule", fileName = ~s_*"test.root" )

process.o = cms.EndPath(process.out, cms.Task(process.work) )

Any of these changes are trivial to do. The question is does the reduction in the verbosity make up for code that may be more difficult (at least initially) to read?

cmsbuild commented 6 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 6 years ago

assign core

cmsbuild commented 6 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

bbockelm commented 6 years ago

Doesn't python have some newfangled type hinting in the 3 series? Is this something we could utilize?

Dr15Jones commented 6 years ago

@bbockelm We are using python 2.7 in CMSSW. We will (probably) attempt a move to python 3 during the long shutdown.

dmitrijus commented 6 years ago

@Dr15Jones

Perhaps it is not very objective reason, but these underscore shorthands look incredibly ugly. I strongly prefer the first code block than 'rewritten' one.

from FWCore.ParameterSet.Shorthand import * is no more prettier than from FWCore.ParameterSet.Config import *

Except with the latter, at least the types are still readable... int32(something), untracked.double.

If anything, the second line import this tells us that Explicit is better than implicit. Operator overloading is, in no way, 'explicit'. It was done before this proposal - *, + - and I have found it somewhat annoying (and non trivial).

For the last part, I don't really see how d_*[1,2] is better than vdouble(1,2).

Overall, this proposal feels like you are reimplementing perl in python. "Explicitness" is what makes python great, I suggest we keep it that way.

belforte commented 6 years ago

I am wondering who is the target here. Experience from CRAB land suggests that many users would be fully lost. OTOH there's ample evidence that most users work by using precooked configs and at most changing one or two values "under dictation" making no attempt to understand what they are doing. So if this is expert stuff.. anything will do, But if general usage is desired, more care for the end user is needed. Looking from afar as someone who had never used this, there's something disturbing in the fact that types need to be specified. Python should work w/o it and if the internals could cast to proper C types if/when needed in order to i/f to EDM and throw if input does not look sensible. No ?

Still from afar it is good that code can be read, all suggested shorthands look set to make things hard there, if aything, process.o is oddly cryptic already! But indeed there is an excessive verbosity :-( Ideally I'd like to type something like this :

process = 'Test'
PoolSource (fileNames = ["foo","bar"])
maxEvents = 10
numTrheads = 4
work = "IntProducer"
PoolOutput (fileName="foobar")
pmaksim1 commented 6 years ago

I second Stefano's suggestion. I ended up checking this out since I agree that the config files are too verbose and hoped to see an improvement. But ~s_ seems incredibly dangerous since it's easy to drop either line, and it's basically yet another config language the poor user has to learn.

I think I can authoritatively speak to the lack of user expertise in this area, since I barely know enough Python to make changes in config files, and I guess correctly only 50% of the time. So for a newbie like myself, verbosity is actually good, since it helps me understand what is what.

I get that this is tedious to experts... but how often do the experts need to write lots of config files? If that ever happens, can't we have scripts or GUI writing config files? (Didn't we have ConfigEditor at some point?) If we are spending time improving the interface, in this era of AI can't we go to natural language processing directly? Or at least have something like Stefano's example above -- not much typing, and perfectly clear to the user...

VinInn commented 6 years ago

I agree that the only sensible change would be to obsolete cms specific types and use plain python literals

makortel commented 6 years ago

Looking from afar as someone who had never used this, there's something disturbing in the fact that types need to be specified. Python should work w/o it and if the internals could cast to proper C types if/when needed in order to i/f to EDM and throw if input does not look sensible.

While I do agree on the verbosity of our configuration language, and that in principle the explicit typing is unnecessary (boilerplate), I want to point out that the current system allows python-level typo checking, i.e.

prod = cms.EDProducer("FooProducer", foo = cms.int32(3))
prod.foo = 4 # works
prod.fop = 5 # throws exception when running through python
prod.fop = cms.int32(5) # works

I find this feature very handy e.g. when migrating/customizing configuration files for some code change. Of course there are other ways of achieving the same checks e.g. in case we drop the explicit types by default.

Dr15Jones commented 6 years ago

I wanted to thank everyone for their comments. I'm afraid fully redesigning the configuration system (and the underlying provenance capture implementation that drives much of the design) is beyond the scope of this particular RFC.

bbockelm commented 6 years ago

@Dr15Jones - dumb question, but is there an urgency here?

That is, if the move to Python3 is delayed until the long shutdown -- and Python 3 offers some native language-level syntax help for type inference -- why not simply wait to do this for the long shutdown?

I mean, I understand the urge to do something here: but it seems at this point we're really only waiting 12 months or so. No?

Dr15Jones commented 6 years ago

@bbockelm in my limited search, the only type control I could find which was added to Python 3 is https://docs.python.org/3/library/typing.html#classes-functions-and-decorators

That appears to only restrict which types can be used with with functions or member data which is not the same problem as this RFC was trying to address (which was shortening the declaration of a type).

Regardless, given the negative feedback on this RFC no changes will be implemented.

fwyzard commented 6 years ago

Hi @Dr15Jones, sorry for joining the party late - I think I agree with many of the comments here.

One good alternative I see would be

import FWCore.ParameterSet.Config as cms

process = cms.Process("Test")

process.source = cms.Source("PoolSource",
    fileNames = [ "foo.root", "bar.root" ]
)

process.maxEvents = dict(
    input = 10
)

process.options = dict(
    numberOfThreads = 4,
    numberOfStreams = 0
)

process.work = cms.EDProducer("IntProducer",
    value = 1
)

process.out = cms.OutputModule("PoolOutputModule",
    fileName = "test.root"
)

process.o = cms.EndPath( process.out, cms.Task(process.work))

, delaying all type and "trackiness" (tracked-ness ?) checking to the C++ part, for example to the ParameterSet validation implemented via fillDescriptions().

It should also be possible to support the old syntax by making the cms.type types aliases for the underlying python types.

(I do not know how the python-to-C++ translation works in CMSSW, whether everything is a string before getParameter(), or if it needs to assign the values to C++ variables; if the latter, we could use some variant-like type)

Here's an even more ambitious syntax (which I don't know if we could actually implement):

process = cms.Process("Test")

process.source = PoolSource(
    fileNames = [ "foo.root", "bar.root" ]
)

process.maxEvents = dict(
    input = 10
)

process.options = dict(
    numberOfThreads = 4,
    numberOfStreams = 0
)

process.work = IntProducer(
    value = 1
)

process.out = PoolOutputModule(
    fileName = "test.root"
)

process.o = cms.EndPath( process.out, cms.Task(process.work))

, deducing the python type from the C++ object.

The "known" PSets like process.options and process.maxEvents could also be simplified to

process.maxEvents(
    input = 10
)

process.options(
    numberOfThreads = 4,
    numberOfStreams = 0
)

but I'm not sure we could do it for arbitrary PSets.

Anyway, just some ideas, probably for LS2...

fwyzard commented 6 years ago

OK, it's actually doable also for arbitrary PSets, with something like:

class PSet:
  def __call__(self, **kwargs):
    self.__dict__.update(kwargs)

class Process:
  def __init__(self, name):
    setattr(self, '@name', name)

  def __getattr__(self, key):
    setattr(self, key, PSet())
    return getattr(self, key)
Dr15Jones commented 5 years ago

27191 incorporates parts of this discussion in order to allow users to not always have to specify the types.