Open prckent opened 6 years ago
I've had this on my radar for a long time. For instance, I went back and forth with Jeongnim and Ken about having validation for the input file in 2008. That said, every time I've looked into it, I've been put off by the potential cost versus the benefits.
The distributed nature of the parsing and construction makes it quite difficult to even collect all of the current valid keywords. This also makes it quite difficult to deal with the "silently ignoring typos" issue because it can be problematic to see exactly which class has the final responsibility for parsing a given section.
That said, I feel from the code side of things that it is rather elegant to structure things this way. This was one of the main reasons that I started writing setup-qmc to handle writing input files as it sidesteps many of these issues.
The best path forward that I've come up with is to define a strict xml schema and to have the xml parser validate against this. Then every time new code was written, it would be required to update the schema accordingly. This could preserve both the current input structure and code architecture.
What I'm quite afraid of is that this could be a real rabbit hole that in addition to sucking up a lot of valuable programmer time could also break a lot of legacy input files and become a real headache for scientific application of qmcpack.
My suggestion would be that before anything is done, a detailed plan of how this could be tackled both in the code and its implications both for users and legacy input should be presented. Then it would be important to get agreement among at least the core developers before putting any plan into action that could affect the interface.
Handling legacy inputs doesn't necessarily need to constrain future changes; e.g. maybe a converter could be provided.
Splitting the XML parsing from the object construction might offer a path to validation as well. If the code is run with just parsing, but no object construction, that would validate the input. (I don't know if there are any dependencies that would cause a problem - some object needs to be constructed before another section could be parsed?)
The current system allows for input to cascade from one object to its subobjects or objects that inherit from it. You certainly could mirror this with a separate parsing hierarchy, but I think it requires duplicating code.
I do certainly appreciate the desirability of a dry run like capability. This could make the code more usable. It would also be a good place to put things about memory requirements etc. I think this could be implemented separately though.
XML parsing is not parallelized, i.e. is replicated.
You mean each rank reads the XML input file? Not good.
The right way to do this is read on master and then broadcast to rest of the ranks.
Two additional pieces of information that may be of interest
Some of our libraries append lead to a DAG in Spack that is fragile. libXML is one of those libraries (see below). pkconfig and xz are pain to compile, mostly due to autoconf failures.
MADNESS uses TinyXML, not sure if it useful -- but really small and self-contained if we wish to remove libXML from the equation. https://github.com/m-a-d-n-e-s-s/madness/tree/master/src/madness/external/tinyxml
[alcfwl120:~] naromero% spack spec qmcpack~mpi
Input spec
--------------------------------
qmcpack~mpi
Normalized
--------------------------------
qmcpack~mpi
^blas
^boost
^cmake@3.4.3:
^espresso@5.3.0~elpa~mpi~scalapack
^fftw~mpi
^lapack
^hdf5~mpi
^zlib@1.1.2:
^libxml2
^pkg-config@0.9.0:
^xz
Common guys. What is the point of parallelizing XML? It was not meant to be big and you are moving to h5 for them anyway.
It was chosen because it can explain the content and neutral so that other codes can communicate with qmcpack. None of that happened so far. Let's see how Molssi people can do better.
On Wed, Oct 11, 2017 at 2:39 PM, Nichols A. Romero <notifications@github.com
wrote:
You mean each rank reads the XML input file? Not good.
The right way to do this is read on master and then broadcast to rest of the ranks.
Two additional pieces of information that may be of interest
1.
Some of our libraries append lead to a DAG in Spack that is fragile. libXML is one of those libraries (see below). pkconfig and xz are pain to compile, mostly due to autoconf failures. 2.
MADNESS uses TinyXML, not sure if it useful -- but really small and self-contained if we wish to remove libXML from the equation. https://github.com/m-a-d-n-e-s-s/madness/tree/master/src/ madness/external/tinyxml
[alcfwl120:] naromero% spack spec qmcpackmpi Input spec
qmcpack~mpi Normalized
qmcpackmpi ^blas ^boost ^cmake@3.4.3: ^espresso@5.3.0elpampiscalapack ^fftwmpi ^lapack ^hdf5mpi ^zlib@1.1.2: ^libxml2 ^pkg-config@0.9.0: ^xz
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/QMCPACK/qmcpack/issues/407#issuecomment-335956824, or mute the thread https://github.com/notifications/unsubscribe-auth/AJCzBgNIfe7nhK2V3KG0KCL9MpURj9Coks5srTWSgaJpZM4PwkA- .
As long as the heavy stuff are in h5 and loaded in proper ways, XML should be small and can be handled anyway we like. For convenience, replicated parsing on every node is good. Parsing by rank 0 and bcasting every control flag are slow and need to code many lines. If reading from disk is a problem, we can just load the whole small XML file in memory and bcast it.
@naromero77 Can we just put a stopper saying that don't build the libxml2 and take the one provided by the system?
Handling of case sensitivity is inconsistent.
Sometimes but not always do we check different CapItAliZations. Var vs var etc.
@ye-luo yes, you can point Spack to particular standard library installations. Yes, replicated parsing is convenient until it causes a problem -- Low priority compared to other issues.
I suggest making everything case insensitive for better usability. Parameters with different meanings that differ only by case is confusing.
Revamp of the input system should include at least three improvements, IMO, and anticipate the later addition of a fourth:
1) Enforce case insensitivity of xml element/attribute/parameter names. 2) Exit with an error if unknown elements/attributes/parameters are encountered. 3) Exit with an error if there are any elements/attributes/parameters which are unread. [4) Exit with an error if inputted values for element body text, attributes, or parameters are incorrect.]
Goal should be to do this will as minimal modification (and thus effort) as possible. Current obstacles:
To (1): XML element names are matched only in nested if statements, so case insensitivity cannot be enforced by a small/modular change of a single class/function as it can be for attributes (OhmmsAttributeSet) and parameters (ParameterSet).
To (2): Knowledge of all element/attribute/parameter names must stored and checked somewhere. (2) could possibly be handled by a single (centralized) XML schema + check. Alternatively, the QMCPACK class that begins the read of an XML element could be responsible for knowing just the names of the allowed XML elements/attributes/parameters and check them (decentralized, possible for attributes/parameters through modifications to OhmmsAttributeSet/ParameterSet as in (1)).
To (3): Cannot be achieved via schema, as what is read or not is local to each class, and this changes with time. Possible minimal modification (no change to class level code already present) would be to rewrap a libxml2 xmlNode in a (derived?) class, allowing for metadata (map(s) tracking whether elements/attributes/parameters have been read) along w/ modifications to OhmmAttributeSet/ParameterSet to mark in the metadata which reads have occurred.
All three in any event require more extensive modifications in the case of XML elements themselves (add ElementSet?) and I'm unsure if a minimally invasive route exists for these.
One quick point,
I do not think aborting on an unused tag should be default. This could break numerous legacy input files. Rather, I think this should be something that can be turned on via a flag or attribute or command line argument.
Luke
Sent from my iPhone
On Oct 23, 2017, at 8:16 AM, jtkrogel notifications@github.com<mailto:notifications@github.com> wrote:
Revamp of the input system should include at least three improvements, IMO, and anticipate the later addition of a fourth:
Goal should be to do this will as minimal modification (and thus effort) as possible. Current obstacles:
To (1): xml element names are matched only in nested if statements, so case insensitivity cannot be enforced by a small/modular change of a single class/function as it can be for attributes (OhmmsAttributeSet) and parameters (ParameterSet).
To (2): Knowledge of all element/attribute/parameter names must stored and checked somewhere. (2) could possibly be handled by a single (centralized) XML schema + check. Alternatively, the QMCPACK class that begins the read of an xml element could be responsible for knowing just the names of the allowed xml elements/attributes/parameters and check them (decentralized, possible for attributes/parameters through modifications to OhmmsAttributeSet/ParameterSet as in (1)).
To (3): Cannot be achieved via schema, as what is read or not is local to each class, and this changes with time. Possible minimal modification (no change to class level code already present) would be to rewrap a libxml2 xmlNode in a (derived?) class, allowing for metadata (map(s) tracking whether elements/attributes/parameters have been read) along w/ modifications to OhmmAttributeSet/ParameterSet to mark in the metadata which reads have occurred.
All three in any event require more extensive modifications in the case of XML elements themselves (add ElementSet?) and I'm unsure if a minimally invasive route exists for these.
- You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/QMCPACK/qmcpack/issues/407#issuecomment-338673062, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AXMsFe4fpITGuliQkFY0ZBNlPtTQkX4Kks5svJ-BgaJpZM4PwkA-.
General comment: Our initial goal is to work out what is most desirable and optimal, totally independent of the amount of effort required to get there. When we have established both the requirements and "nice to have" features, we will work out how close we can get. We definitely don't want to have to revisit this topic.
Wishlist: Depending on how invasive the XML overhaul is going to be, I would like to create an abstraction layer between the initialization of code objects and the file I/O. Something like (<-- denoting derived from) QMCIOStream<--XMLParser. Makes it significantly easier to initialize QMCPACK from things other than the XML file... from another code for instance.
On ideals. Any part of the below is an improvement (IMHO):
1) User facing input file is no longer XML. XML is confusing to read/write and has too many extraneous delimiting characters for simple, error free composability (e.g. via text editor). XML additionally invites putting simple name/value pairs as attributes, with increasing pressure to move them to the body (e.g. as parameters) when the count gets too high. Parameters and attributes are essentially the same, and therefore should be represented as such in non-XML (e.g. timestep = 0.01
, not <parameter name="timestep"> 0.01 </parameter>
). XML was used initially to allow structured input w/ easy initial parse. Still easy to represent structured input in non-XML and easy to write parsers for such input. Even if we keep XML internally, the user does not need to be exposed to it.
2) Flatten structured input to the extent possible. Widely used codes have flatter input profiles because they are easier to use IMHO. This also forces more effort into providing only the minimum number of user-facing parameter/structures with the underlying code being a little smarter (e.g. how much info/text do we really need to provide to request a reasonable initial Jastrow w/ 1,2,3 body terms?).
3) Initial document read handled centrally, with full input file spec. provided (as in Nexus). This makes the full input file transparent, and additions easy to make and very visible to all other developers. Unrecognized/unread names and invalid parameter values caught at this point. Can also easily enforce required vs. optional parameters and default values in this central space. Centralized input spec. also makes it easier to verify that the manual and in-code input spec. are currently synced (if someone tries to answer this question a year from now with the current "in place" XML parsing, they will end up diving through lots of files and code). Support for older XML input provided via a mapping function onto the new input.
4) Initial read parses document into nested set of objects (map-like). Builder classes use these instead of XML to initialize objects. Cleans up any initialization outside of the "read the input file" path such as tests and the code integration that Ray mentions above.
5) Local objects also catch errors in inputted parameters as input can come from the input file and elsewhere. The specification classes used in the general input check can simply be reused to avoid duplication of checking code (i.e. class receiving/being built makes a function call, but otherwise takes the input as is; no duplication).
Thought we complain about the XML, we take the benefit of it as granted. When we consider another input system, all the benefit of XML may be gone and we have to implement these features by ourselves. As step zero, we can augment OhmmsAttributeSet, ParameterSet classes to study and survey the validation scheme.
I respectfully disagree with a good deal of this.
I personally think XML is a nice heirarchical way to define input. It is clear, easy for both humans and machines to parse and has the benefit of a huge amount of infrastructure. It simply enables things like translation between different codes through a common standardization language. It also allows referencing of other documents for free (for example the way that pw2qmcpack's input files look). I agree that the way we write xml is a bit clunky, but simple fixes that make the file much less verbose are easy. For example changing from: ``
I also disagree here. In my opinion the input file should only be flattened when parameters and values are not logically nested. For an example of where I think input files are too nested, I look to quantum espresso. They lack a clear hierarchy and because of this it is difficult to figure out where to put things like nosym=.true.
While I appreciate there are benefits to this and I would like to enable them as possible, I believe that there is a lot of beauty to the current architecture of the code where objects themselves know how to handle their attributes and parameters. I believe it is not in our best interest to separate the specification and implementation of the objects.
This is think is a good idea. I don't think the XML DOM is the most natural way to pass all of this infrastructure around.
Not sure I follow this one, I would be interested in hearing more what you mean.
No problem. My second post is just putting forward what I think the ideal looks like; it would take quite a bit of effort to do what I listed. My first post is more like what I think we may end up doing.
I don't know what kind of process we want to use to come to actionable conclusions about the various ideas/ideals we all express here.
I think there are a number of things that we all already agree on. As far as the next iteration of qmcpack, I suggest we target the following:
I think this would put us on a firmer footing to discuss larger changes to the input format such as what Jaron is suggesting in the future and it would protect the current code architecture.
What does everyone else think? Too conservative of an approach?
For the example given by @lshulen
<parameter name="timestep">0.01</parameter>
to: <timestep>0.01</timestep>
I don't like the mixing:
<estimator name="LocalEnergy" hdf5="no"/>
<timestep>0.01</timestep>
but prefer the old way
<estimator name="LocalEnergy" hdf5="no"/>
<parameter name="timestep">0.01</parameter>
Keep the possibility of adding estimators in later VMC/DMC blocks is still preferred.
@ye-luo I see your point. Another possibility to keep that separation working would be to allow multiple parameters that only take a single argument to be handled together:
``
Could become:
That suggestion aside, I think adding a layer of indirection through a new data structure could ease such transitions as we discuss further at the level of individual tags as necessary.
Creating an intermediate data structure still needs a significant amount of effort. In our OhmmsAttributeSet, ParameterSet classes designed for picking up attributes and parameters. We can already do the lowercase conversion and some level of validation.
Again, General comment: Our initial goal is to work out what is most desirable and optimal, totally independent of the amount of effort required to get there. When we have established both the requirements and "nice to have" features, we will work out how close we can get. We definitely don't want to have to revisit this topic.
Not to ignore Paul's remark, but I have a quick question for @ye-luo
To me it sees that you can think of using OhmmsAttributeSet and ParameterSet as the beginnings of what I was meaning when I suggested making an intermediate data structure. Does that make sense to you?
Has anyone asked any QMCPACK users (especially newer ones) what they think could be better about the interface?
@lshulen I think OhmmsAttributeSet and ParameterSet are the starting places to enforce cases and validation. As @jtkrogel mentioned, this doesn't cover all the places but largely.
Overall, I just feel the problem is actually not XML but how we handle the input information (how we design the input). I think the physics tells us how to speak the XML language. It is clearly in hierarchy not flat.
For a flatten structure or intermediate data structure. Basically we change from 1) A(XML), B(something other input source) -> D (builder) -> E (classes) to 2) A(XML), B(something other input source) -> C (intermediate representation) -> D (builder) -> E (classes) We have extra maintenance pain (C(IR)) and reduce flexibility (A,B not directly reaching D). C needs to cover all the needs of A and B.
For example, Ray's needs of transferring QE orbitals directly via memory to QMCPACK, having C is more restrictive than directly have a B->D builder routine.
I'll list a few generalities that I don't think are specific to QMCPACK here (all IMHO):
Input files should:
Code reading an input file should:
enthusiastic "+1 " to Jaron's latest comment above this one
Looking through the code, some comments:
The driver sections (with
Case handling. There are two functions to convert to lower case - tolower (in OhmmsElementBase.h) that converts a string to lower case in place, and getNodeName (in libxmldefs.h) which reads the name member from an xmlNodePtr and converts it to lower case. The tag handling loops use a few idioms . Mostly the name element is converted to a string and no case conversion is done. In a few places getNodeName or tolower is called, making those tag names case-insensitive. For parameter handling, Parameter set calls tolower so parameter names are case insensitive. OhmmsAttributeSet does not, so attribute names are case sensitive.
For tags, there should probably be another class to handle these and make it uniform. It could handle the case-insensitivity, and checking for unused or extra tags. I'm not sure how the interface would look.
My two cents: I've used QMCPack, PIMC++, and other input styles and after various iterations (including abandoning my own PIMC++ input style), here is what I've converged to as the best approach for my most recent codes including my lattice VMC/DMC code (your mileage may vary)
Input have sections and are in the form of keyword=value. An example input for my code looks something like this:
RunType=OPT
Optimization:
OptType=GRADIENT
StepSize=0.01
NumSteps=1000
EquilSweeps=100
SampleSweeps=1000
END
NumWalkers=1
doping=0.875
MoveType=HOP
WaveFunction:
name=RVB
PairingFunction=PF.dat
END
Hamiltonian:
name=hopping
bondFile=bond.dat
t=1.0
END
Hamiltonian:
name=U
U=4.0
END
ReadParams=False
Then inside the code, the entire input gets read into a InputClass:
InputClass myInput;
myInput.Read(infile);
infile.close();
and the InputClass gets pushed around to objects which can read them as
while (myInput.OpenSection("Hamiltonian",check)){
string hamiltonian_Name=myInput.GetVariable("name");
cout<<"Found Hamiltonian number "<<check<<" called "<<hamiltonian_Name<<endl;
assert(myInput.IsVariable("U"));
double U = myInput.toDouble(myInput.GetVariable("U"))
myInput.CloseSection()
My experience is that this format has been very easy for myself and students to read, make inputs for, and use within codes. In fact, I require my undergraduate students to use it as part of the programs they develop for homework and they have found it straightforward to use.
Within my codes, my general approach has been (roughly) that individual objects are often given an InputClass with a section already open so they can read the variables from that section. These objects can then check that required variables are there, etc.
The major downside of this approach I've found is that I've mainly kept long arrays in separate files (which has some upsides but also some downsides). I've considered adding an "array" type but haven't done it yet.
I'm not sure if this is the right route to go for QMCPack but it's the approach I like most so far.
Thanks Bryan Clark
Bryan, thanks for your comments. This is pretty similar to what I would do.
Thanks Bryan. I think a key trick for QMCPACK will be to separate the input reading/parsing from the use of the data via an intermediate set of classes. This would actually allow us to change from XML inputs in future (if we wanted/dared/needed to) without changing the internals of QMCPACK.
Yes. I agree that's an important first step. Even if you were going to leave the input alone, I think it's still worth consider approaching that goal of separating input from use of data by defining an InputClass with an interface something like:
OpenSection("mySectionName") //(returns true or false if it's there)
OpenSection("mySectionName",i) // (returns true or false if the i'th "mySectionName" is there)
CloseSection()
GetVariable("myVariableName")
IsVariable("myVariableName") //(returns true or false if it's there)
Thanks Bryan Clark
Another motivation for improved input validation #619 . When the input processing is reworked we should be alert for error traps / consistency checks to add.
On the topic of wrapper classes to enable validation, I propose we (eventually) do something similar to the following, in which I introduce a single class for an XML element that will actually allow for validation later (i.e. knowing at least what was read and what was not).
Our current XML reading looks like this (and note that even with our current wrapper classes, we still interact directly with libxml2 in what is below):
xmlNodePtr cur(curXml);
std::string type;
std::string source;
int param1;
RealType param2;
// read attributes w/ dedicated class, no record of what has been read
OhmmsAttributeSet attrib;
attrib.add(type ,"type" );
attrib.add(source,"source");
attrib.put(cur);
// read parameters w/ dedicated class, no record of what has been read
ParameterSet params;
params.add(param1,"param1","int");
params.add(param2,"param2","real");
params.put(cur);
// read child elements
// interact directly w/ libxml2, no wrapper
cur = cur->children;
while (cur != NULL)
{
// interact directly w/ libxml2, no wrapper
std::string cname((const char*)cur->name);
// read attributes w/ dedicated class, no record of what has been read
std::string pot_type = "none";
OhmmsAttributeSet attrib;
attrib.add(pot_type, "type");
if (cname == "pairpot")
{
if (pot_type == "coulomb")
{
// process coulomb pair potential
// no record of read is made
}
else if(pot_type == "mpc")
{
// process mpc pair potential
// no record of read is made
}
else
{
// elements not read are silently ignored
}
else if(cname == "estimator")
{
// process estimators similarly
}
}
I propose moving to a single class that allows validation, but preserves the current read structure (small change to existing code):
InputElement elem(curXml);
std::string type;
std::string source;
int param1;
RealType param2;
// read attributes and document internally what has been read
elem.readAttribute(type ,"type");
elem.readAttribute(source,"source");
// read parameters and document internally what has been read
elem.readParameter(param1,"param1");
elem.readParameter(param2,"param2");
// read child elements
for(auto pair : elem.elements)
{
std::string& name = pair.first;
auto& viewer = pair.second; // view of the element
if(name=="pairpot")
{
// return attribute, mark as viewed
std::string pot_type = "none";
viewer.viewAttribute(pot_type,"type");
if(pot_type=="coulomb")
{
// return child element, mark as read
// also mark viewed attributes/parameters as having been read
InputElement& elem = viewer.read();
// process coulomb pair potential
}
else if(pot_type=="mpc")
{
// make record of element read
InputElement& elem = viewer.read();
// process mpc pair potential
}
else
{
// in this case nothing has been read or marked as having been read
// can check later if there are unparsed elements
}
}
elif(name=="estimator")
{
// process estimators similarly
}
}
An change in usage patterns to consider: historically all aspects of the wavefunction could be specified in the "input file" XML. However we have been trending towards HDF5 for just about everything since the number of basis functions, determinants etc. is often large even for calculations with low numbers of electrons.
I think the XML will be "easy" to change to yaml or key value with blocks once the main design issue with the input is dealt with. So specifics of the XML could be improved but from a application architecture standpoint are a separate task that should consider usability within the constraints that a sane application architecture require.
So what are those constraints. Here another stab at that.
Lets call this set of datastructures the QMCPACK input structures (QIS). Going forward new objects will never take xmlnode pointers as constructor or put arguments, only QIS or other native types.
How does this effect the input level?
In light of #2902 I'd like to restate that one goal I have for the input classes is that it be possible to contstruct more QMC input's in C++ without parsing xml.
The initialization of its variables to default or "zero" values should take place in one spot and not require parsing to occur. For use of static analysis tools and even for humans to read the code its better that this occurs using c++ language features and not requiring actual runtime execution. I.e. ParameterSet doing validation (fine) ParameterSet being the method of default intitialization (sketchy).
see QMCDriverInput.h
@PDoakORNL agreed.
For the input parsing/storing classes, I think code similar to the following should be used in the base class (map/hash table of arbitrary type) in order to minimize boiler plate code (C++17):
std::map<std::string, std::any> notebook;
std::string name{ "Pluto" };
int year = 2015;
notebook["PetName"] = name;
notebook["Born"] = year;
std::string name2 = std::any_cast<std::string>(notebook["PetName"]); // = "Pluto"
int year2 = std::any_cast<int>(notebook["Born"]); // = 2015
A good time to revisit this will be following the eventual deletion of legacy drivers.
This issue serves to gather requirements and suggestions for revisions to the input system, and also to gather known problems with the existing system. I envisage revising the situation around v4.0.0, i.e. not immediately, but within a year.
Please add comments.
The current system of input settings and overall XML handling is showing strains. From the user perspective, we lack validation, silently ignore typos/errors, lack sufficient defaults, and don't have a way to show all the settings that are set or used by a given input. Many input files specify properties that should be defaults or that we even don't read, e.g. "version" information, although on occasion the version settings are actually important. The inputs are therefore much more cluttered than necessary. As new functionality is added or older/less used functionality is reactivated, this is an increasing problem.
From the computational side, the current implementation lacks separation of responsibilities between object instantiation and input handling making construction of unit tests difficult. There are also no clear docs explaining this aspect of the code or the expected standards, which anyone seeking to add an observable would need to know. We also know the current implementation of XML parsing leaks memory.
Related issues, most of which need to be triaged ahead of a bigger overhaul include:
Relocate command line flags: #146 Input setting validation: #228 Input robustness in Jastrow specification: #282 New capabilities (twist averaging in one input): #147 Defaults: #19 (and really defaults for nearly all settings in the simulation) Verbosity setting: #404