QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
285 stars 135 forks source link

Goals for overhaul of all aspects of XML input system #407

Open prckent opened 6 years ago

prckent commented 6 years ago

This issue serves to gather requirements and suggestions for revisions to the input system, and also to gather known problems with the existing system. I envisage revising the situation around v4.0.0, i.e. not immediately, but within a year.

Please add comments.

The current system of input settings and overall XML handling is showing strains. From the user perspective, we lack validation, silently ignore typos/errors, lack sufficient defaults, and don't have a way to show all the settings that are set or used by a given input. Many input files specify properties that should be defaults or that we even don't read, e.g. "version" information, although on occasion the version settings are actually important. The inputs are therefore much more cluttered than necessary. As new functionality is added or older/less used functionality is reactivated, this is an increasing problem.

From the computational side, the current implementation lacks separation of responsibilities between object instantiation and input handling making construction of unit tests difficult. There are also no clear docs explaining this aspect of the code or the expected standards, which anyone seeking to add an observable would need to know. We also know the current implementation of XML parsing leaks memory.

Related issues, most of which need to be triaged ahead of a bigger overhaul include:

Relocate command line flags: #146 Input setting validation: #228 Input robustness in Jastrow specification: #282 New capabilities (twist averaging in one input): #147 Defaults: #19 (and really defaults for nearly all settings in the simulation) Verbosity setting: #404

lshulen commented 6 years ago

I've had this on my radar for a long time. For instance, I went back and forth with Jeongnim and Ken about having validation for the input file in 2008. That said, every time I've looked into it, I've been put off by the potential cost versus the benefits.

The distributed nature of the parsing and construction makes it quite difficult to even collect all of the current valid keywords. This also makes it quite difficult to deal with the "silently ignoring typos" issue because it can be problematic to see exactly which class has the final responsibility for parsing a given section.

That said, I feel from the code side of things that it is rather elegant to structure things this way. This was one of the main reasons that I started writing setup-qmc to handle writing input files as it sidesteps many of these issues.

The best path forward that I've come up with is to define a strict xml schema and to have the xml parser validate against this. Then every time new code was written, it would be required to update the schema accordingly. This could preserve both the current input structure and code architecture.

What I'm quite afraid of is that this could be a real rabbit hole that in addition to sucking up a lot of valuable programmer time could also break a lot of legacy input files and become a real headache for scientific application of qmcpack.

My suggestion would be that before anything is done, a detailed plan of how this could be tackled both in the code and its implications both for users and legacy input should be presented. Then it would be important to get agreement among at least the core developers before putting any plan into action that could affect the interface.

grahamlopez commented 6 years ago

Handling legacy inputs doesn't necessarily need to constrain future changes; e.g. maybe a converter could be provided.

markdewing commented 6 years ago

Splitting the XML parsing from the object construction might offer a path to validation as well. If the code is run with just parsing, but no object construction, that would validate the input. (I don't know if there are any dependencies that would cause a problem - some object needs to be constructed before another section could be parsed?)

lshulen commented 6 years ago

The current system allows for input to cascade from one object to its subobjects or objects that inherit from it. You certainly could mirror this with a separate parsing hierarchy, but I think it requires duplicating code.

I do certainly appreciate the desirability of a dry run like capability. This could make the code more usable. It would also be a good place to put things about memory requirements etc. I think this could be implemented separately though.

prckent commented 6 years ago

XML parsing is not parallelized, i.e. is replicated.

naromero77 commented 6 years ago

You mean each rank reads the XML input file? Not good.

The right way to do this is read on master and then broadcast to rest of the ranks.

Two additional pieces of information that may be of interest

  1. Some of our libraries append lead to a DAG in Spack that is fragile. libXML is one of those libraries (see below). pkconfig and xz are pain to compile, mostly due to autoconf failures.

  2. MADNESS uses TinyXML, not sure if it useful -- but really small and self-contained if we wish to remove libXML from the equation. https://github.com/m-a-d-n-e-s-s/madness/tree/master/src/madness/external/tinyxml

[alcfwl120:~] naromero% spack spec qmcpack~mpi
Input spec
--------------------------------
qmcpack~mpi

Normalized
--------------------------------
qmcpack~mpi
    ^blas
    ^boost
    ^cmake@3.4.3:
    ^espresso@5.3.0~elpa~mpi~scalapack
        ^fftw~mpi
        ^lapack
    ^hdf5~mpi
        ^zlib@1.1.2:
    ^libxml2
        ^pkg-config@0.9.0:
        ^xz
jngkim commented 6 years ago

Common guys. What is the point of parallelizing XML? It was not meant to be big and you are moving to h5 for them anyway.

It was chosen because it can explain the content and neutral so that other codes can communicate with qmcpack. None of that happened so far. Let's see how Molssi people can do better.

On Wed, Oct 11, 2017 at 2:39 PM, Nichols A. Romero <notifications@github.com

wrote:

You mean each rank reads the XML input file? Not good.

The right way to do this is read on master and then broadcast to rest of the ranks.

Two additional pieces of information that may be of interest

1.

Some of our libraries append lead to a DAG in Spack that is fragile. libXML is one of those libraries (see below). pkconfig and xz are pain to compile, mostly due to autoconf failures. 2.

MADNESS uses TinyXML, not sure if it useful -- but really small and self-contained if we wish to remove libXML from the equation. https://github.com/m-a-d-n-e-s-s/madness/tree/master/src/ madness/external/tinyxml

[alcfwl120:] naromero% spack spec qmcpackmpi Input spec

qmcpack~mpi Normalized

qmcpackmpi ^blas ^boost ^cmake@3.4.3: ^espresso@5.3.0elpampiscalapack ^fftwmpi ^lapack ^hdf5mpi ^zlib@1.1.2: ^libxml2 ^pkg-config@0.9.0: ^xz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/QMCPACK/qmcpack/issues/407#issuecomment-335956824, or mute the thread https://github.com/notifications/unsubscribe-auth/AJCzBgNIfe7nhK2V3KG0KCL9MpURj9Coks5srTWSgaJpZM4PwkA- .

ye-luo commented 6 years ago

As long as the heavy stuff are in h5 and loaded in proper ways, XML should be small and can be handled anyway we like. For convenience, replicated parsing on every node is good. Parsing by rank 0 and bcasting every control flag are slow and need to code many lines. If reading from disk is a problem, we can just load the whole small XML file in memory and bcast it.

ye-luo commented 6 years ago

@naromero77 Can we just put a stopper saying that don't build the libxml2 and take the one provided by the system?

prckent commented 6 years ago

Handling of case sensitivity is inconsistent.

Sometimes but not always do we check different CapItAliZations. Var vs var etc.

naromero77 commented 6 years ago

@ye-luo yes, you can point Spack to particular standard library installations. Yes, replicated parsing is convenient until it causes a problem -- Low priority compared to other issues.

grahamlopez commented 6 years ago

I suggest making everything case insensitive for better usability. Parameters with different meanings that differ only by case is confusing.

jtkrogel commented 6 years ago

Revamp of the input system should include at least three improvements, IMO, and anticipate the later addition of a fourth:

1) Enforce case insensitivity of xml element/attribute/parameter names. 2) Exit with an error if unknown elements/attributes/parameters are encountered. 3) Exit with an error if there are any elements/attributes/parameters which are unread. [4) Exit with an error if inputted values for element body text, attributes, or parameters are incorrect.]

Goal should be to do this will as minimal modification (and thus effort) as possible. Current obstacles:

To (1): XML element names are matched only in nested if statements, so case insensitivity cannot be enforced by a small/modular change of a single class/function as it can be for attributes (OhmmsAttributeSet) and parameters (ParameterSet).

To (2): Knowledge of all element/attribute/parameter names must stored and checked somewhere. (2) could possibly be handled by a single (centralized) XML schema + check. Alternatively, the QMCPACK class that begins the read of an XML element could be responsible for knowing just the names of the allowed XML elements/attributes/parameters and check them (decentralized, possible for attributes/parameters through modifications to OhmmsAttributeSet/ParameterSet as in (1)).

To (3): Cannot be achieved via schema, as what is read or not is local to each class, and this changes with time. Possible minimal modification (no change to class level code already present) would be to rewrap a libxml2 xmlNode in a (derived?) class, allowing for metadata (map(s) tracking whether elements/attributes/parameters have been read) along w/ modifications to OhmmAttributeSet/ParameterSet to mark in the metadata which reads have occurred.

All three in any event require more extensive modifications in the case of XML elements themselves (add ElementSet?) and I'm unsure if a minimally invasive route exists for these.

lshulen commented 6 years ago

One quick point,

I do not think aborting on an unused tag should be default. This could break numerous legacy input files. Rather, I think this should be something that can be turned on via a flag or attribute or command line argument.

Luke

Sent from my iPhone

On Oct 23, 2017, at 8:16 AM, jtkrogel notifications@github.com<mailto:notifications@github.com> wrote:

Revamp of the input system should include at least three improvements, IMO, and anticipate the later addition of a fourth:

  1. Enforce case insensitivity of xml element/attribute/parameter names.
  2. Exit with an error if unknown elements/attributes/parameters are encountered.
  3. Exit with an error if there are any elements/attributes/parameters which are unread. [4) Exit with an error if inputted values for parameters are incorrect.]

Goal should be to do this will as minimal modification (and thus effort) as possible. Current obstacles:

To (1): xml element names are matched only in nested if statements, so case insensitivity cannot be enforced by a small/modular change of a single class/function as it can be for attributes (OhmmsAttributeSet) and parameters (ParameterSet).

To (2): Knowledge of all element/attribute/parameter names must stored and checked somewhere. (2) could possibly be handled by a single (centralized) XML schema + check. Alternatively, the QMCPACK class that begins the read of an xml element could be responsible for knowing just the names of the allowed xml elements/attributes/parameters and check them (decentralized, possible for attributes/parameters through modifications to OhmmsAttributeSet/ParameterSet as in (1)).

To (3): Cannot be achieved via schema, as what is read or not is local to each class, and this changes with time. Possible minimal modification (no change to class level code already present) would be to rewrap a libxml2 xmlNode in a (derived?) class, allowing for metadata (map(s) tracking whether elements/attributes/parameters have been read) along w/ modifications to OhmmAttributeSet/ParameterSet to mark in the metadata which reads have occurred.

All three in any event require more extensive modifications in the case of XML elements themselves (add ElementSet?) and I'm unsure if a minimally invasive route exists for these.

- You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/QMCPACK/qmcpack/issues/407#issuecomment-338673062, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AXMsFe4fpITGuliQkFY0ZBNlPtTQkX4Kks5svJ-BgaJpZM4PwkA-.

prckent commented 6 years ago

General comment: Our initial goal is to work out what is most desirable and optimal, totally independent of the amount of effort required to get there. When we have established both the requirements and "nice to have" features, we will work out how close we can get. We definitely don't want to have to revisit this topic.

rcclay commented 6 years ago

Wishlist: Depending on how invasive the XML overhaul is going to be, I would like to create an abstraction layer between the initialization of code objects and the file I/O. Something like (<-- denoting derived from) QMCIOStream<--XMLParser. Makes it significantly easier to initialize QMCPACK from things other than the XML file... from another code for instance.

jtkrogel commented 6 years ago

On ideals. Any part of the below is an improvement (IMHO):

1) User facing input file is no longer XML. XML is confusing to read/write and has too many extraneous delimiting characters for simple, error free composability (e.g. via text editor). XML additionally invites putting simple name/value pairs as attributes, with increasing pressure to move them to the body (e.g. as parameters) when the count gets too high. Parameters and attributes are essentially the same, and therefore should be represented as such in non-XML (e.g. timestep = 0.01, not <parameter name="timestep"> 0.01 </parameter>). XML was used initially to allow structured input w/ easy initial parse. Still easy to represent structured input in non-XML and easy to write parsers for such input. Even if we keep XML internally, the user does not need to be exposed to it.

2) Flatten structured input to the extent possible. Widely used codes have flatter input profiles because they are easier to use IMHO. This also forces more effort into providing only the minimum number of user-facing parameter/structures with the underlying code being a little smarter (e.g. how much info/text do we really need to provide to request a reasonable initial Jastrow w/ 1,2,3 body terms?).

3) Initial document read handled centrally, with full input file spec. provided (as in Nexus). This makes the full input file transparent, and additions easy to make and very visible to all other developers. Unrecognized/unread names and invalid parameter values caught at this point. Can also easily enforce required vs. optional parameters and default values in this central space. Centralized input spec. also makes it easier to verify that the manual and in-code input spec. are currently synced (if someone tries to answer this question a year from now with the current "in place" XML parsing, they will end up diving through lots of files and code). Support for older XML input provided via a mapping function onto the new input.

4) Initial read parses document into nested set of objects (map-like). Builder classes use these instead of XML to initialize objects. Cleans up any initialization outside of the "read the input file" path such as tests and the code integration that Ray mentions above.

5) Local objects also catch errors in inputted parameters as input can come from the input file and elsewhere. The specification classes used in the general input check can simply be reused to avoid duplication of checking code (i.e. class receiving/being built makes a function call, but otherwise takes the input as is; no duplication).

ye-luo commented 6 years ago

Thought we complain about the XML, we take the benefit of it as granted. When we consider another input system, all the benefit of XML may be gone and we have to implement these features by ourselves. As step zero, we can augment OhmmsAttributeSet, ParameterSet classes to study and survey the validation scheme.

lshulen commented 6 years ago

I respectfully disagree with a good deal of this.

  1. I personally think XML is a nice heirarchical way to define input. It is clear, easy for both humans and machines to parse and has the benefit of a huge amount of infrastructure. It simply enables things like translation between different codes through a common standardization language. It also allows referencing of other documents for free (for example the way that pw2qmcpack's input files look). I agree that the way we write xml is a bit clunky, but simple fixes that make the file much less verbose are easy. For example changing from: ``

    0.01 to: 0.01 ``
  2. I also disagree here. In my opinion the input file should only be flattened when parameters and values are not logically nested. For an example of where I think input files are too nested, I look to quantum espresso. They lack a clear hierarchy and because of this it is difficult to figure out where to put things like nosym=.true.

  3. While I appreciate there are benefits to this and I would like to enable them as possible, I believe that there is a lot of beauty to the current architecture of the code where objects themselves know how to handle their attributes and parameters. I believe it is not in our best interest to separate the specification and implementation of the objects.

  4. This is think is a good idea. I don't think the XML DOM is the most natural way to pass all of this infrastructure around.

  5. Not sure I follow this one, I would be interested in hearing more what you mean.

jtkrogel commented 6 years ago

No problem. My second post is just putting forward what I think the ideal looks like; it would take quite a bit of effort to do what I listed. My first post is more like what I think we may end up doing.

I don't know what kind of process we want to use to come to actionable conclusions about the various ideas/ideals we all express here.

lshulen commented 6 years ago

I think there are a number of things that we all already agree on. As far as the next iteration of qmcpack, I suggest we target the following:

  1. Create an intermediate data structure that would be generated by the parser, at this stage things like taking all names to lower case could be done.
  2. Instrument the intermediate data structure so that it knows when tags / values have been used so that it can report (or if desired exit) when tags are unused.
  3. Update all put methods in the code to use the new intermediate data structure.

I think this would put us on a firmer footing to discuss larger changes to the input format such as what Jaron is suggesting in the future and it would protect the current code architecture.

What does everyone else think? Too conservative of an approach?

ye-luo commented 6 years ago

For the example given by @lshulen <parameter name="timestep">0.01</parameter> to: <timestep>0.01</timestep> I don't like the mixing:

<estimator name="LocalEnergy" hdf5="no"/>
<timestep>0.01</timestep>

but prefer the old way

<estimator name="LocalEnergy" hdf5="no"/>
<parameter name="timestep">0.01</parameter>

Keep the possibility of adding estimators in later VMC/DMC blocks is still preferred.

lshulen commented 6 years ago

@ye-luo I see your point. Another possibility to keep that separation working would be to allow multiple parameters that only take a single argument to be handled together:

``

0.01 64 10

Could become: ``

That suggestion aside, I think adding a layer of indirection through a new data structure could ease such transitions as we discuss further at the level of individual tags as necessary.

ye-luo commented 6 years ago

Creating an intermediate data structure still needs a significant amount of effort. In our OhmmsAttributeSet, ParameterSet classes designed for picking up attributes and parameters. We can already do the lowercase conversion and some level of validation.

prckent commented 6 years ago

Again, General comment: Our initial goal is to work out what is most desirable and optimal, totally independent of the amount of effort required to get there. When we have established both the requirements and "nice to have" features, we will work out how close we can get. We definitely don't want to have to revisit this topic.

lshulen commented 6 years ago

Not to ignore Paul's remark, but I have a quick question for @ye-luo

To me it sees that you can think of using OhmmsAttributeSet and ParameterSet as the beginnings of what I was meaning when I suggested making an intermediate data structure. Does that make sense to you?

grahamlopez commented 6 years ago

Has anyone asked any QMCPACK users (especially newer ones) what they think could be better about the interface?

ye-luo commented 6 years ago

@lshulen I think OhmmsAttributeSet and ParameterSet are the starting places to enforce cases and validation. As @jtkrogel mentioned, this doesn't cover all the places but largely.

Overall, I just feel the problem is actually not XML but how we handle the input information (how we design the input). I think the physics tells us how to speak the XML language. It is clearly in hierarchy not flat.

ye-luo commented 6 years ago

For a flatten structure or intermediate data structure. Basically we change from 1) A(XML), B(something other input source) -> D (builder) -> E (classes) to 2) A(XML), B(something other input source) -> C (intermediate representation) -> D (builder) -> E (classes) We have extra maintenance pain (C(IR)) and reduce flexibility (A,B not directly reaching D). C needs to cover all the needs of A and B.

For example, Ray's needs of transferring QE orbitals directly via memory to QMCPACK, having C is more restrictive than directly have a B->D builder routine.

jtkrogel commented 6 years ago

I'll list a few generalities that I don't think are specific to QMCPACK here (all IMHO):

Input files should:

  1. Be easy to read and understand a. Use as few characters as possible without losing information (low markup, etc.) b. Use a low level of nesting, perhaps departing from internal code structure for the sake of usability c. Have a low level of implicitness (e.g. few unnamed structured inputs, array data, etc.) d. Require as few input variables as possible for correct operation
  2. Be easy to write correctly a. Be case insensitive b. Be white space insensitive, within reason (e.g. C may have gone too far here) c. Be insensitive to input section/variable order to the extent possible d. Follow principles in 1 a-d, as they all help minimize composition error (An input file fails this test if the following advice applies: "Users must start from another input file that worked")

Code reading an input file should:

  1. Be able recognize and guard against unrecognized names
  2. Ensure all the input was read and used (allowing for deprecation/backwards compatibility)
  3. Ensure input values are valid
  4. Write instructive messages (including input file line numbers) when user mistakes are made
grahamlopez commented 6 years ago

enthusiastic "+1 " to Jaron's latest comment above this one

markdewing commented 6 years ago

Looking through the code, some comments:

The driver sections (with tags) are currently parsed and immediately executed. Instead, all the driver sections should be parsed, loaded into a data structure, and executed from that data structure. This will ensure all the sections are validated before execution starts. Are there any dependencies that would prevent parsing every section at the beginning (ie, do any of the drivers depend on the results of a previous driver for setup?)

Case handling. There are two functions to convert to lower case - tolower (in OhmmsElementBase.h) that converts a string to lower case in place, and getNodeName (in libxmldefs.h) which reads the name member from an xmlNodePtr and converts it to lower case. The tag handling loops use a few idioms . Mostly the name element is converted to a string and no case conversion is done. In a few places getNodeName or tolower is called, making those tag names case-insensitive. For parameter handling, Parameter set calls tolower so parameter names are case insensitive. OhmmsAttributeSet does not, so attribute names are case sensitive.

For tags, there should probably be another class to handle these and make it uniform. It could handle the case-insensitivity, and checking for unused or extra tags. I'm not sure how the interface would look.

bkclark commented 6 years ago

My two cents: I've used QMCPack, PIMC++, and other input styles and after various iterations (including abandoning my own PIMC++ input style), here is what I've converged to as the best approach for my most recent codes including my lattice VMC/DMC code (your mileage may vary)

Input have sections and are in the form of keyword=value. An example input for my code looks something like this:

RunType=OPT
Optimization:
  OptType=GRADIENT
  StepSize=0.01
  NumSteps=1000
  EquilSweeps=100
  SampleSweeps=1000
END
NumWalkers=1
doping=0.875
MoveType=HOP
WaveFunction:
     name=RVB
         PairingFunction=PF.dat
END
Hamiltonian:
     name=hopping
     bondFile=bond.dat
     t=1.0
END
Hamiltonian:
     name=U
     U=4.0
END
ReadParams=False

Then inside the code, the entire input gets read into a InputClass:

InputClass myInput;
myInput.Read(infile);
infile.close();

and the InputClass gets pushed around to objects which can read them as

while (myInput.OpenSection("Hamiltonian",check)){
     string hamiltonian_Name=myInput.GetVariable("name");
     cout<<"Found Hamiltonian number "<<check<<"  called "<<hamiltonian_Name<<endl;
     assert(myInput.IsVariable("U"));
     double U = myInput.toDouble(myInput.GetVariable("U"))
     myInput.CloseSection()

My experience is that this format has been very easy for myself and students to read, make inputs for, and use within codes. In fact, I require my undergraduate students to use it as part of the programs they develop for homework and they have found it straightforward to use.

Within my codes, my general approach has been (roughly) that individual objects are often given an InputClass with a section already open so they can read the variables from that section. These objects can then check that required variables are there, etc.

The major downside of this approach I've found is that I've mainly kept long arrays in separate files (which has some upsides but also some downsides). I've considered adding an "array" type but haven't done it yet.

I'm not sure if this is the right route to go for QMCPack but it's the approach I like most so far.

Thanks Bryan Clark

jtkrogel commented 6 years ago

Bryan, thanks for your comments. This is pretty similar to what I would do.

prckent commented 6 years ago

Thanks Bryan. I think a key trick for QMCPACK will be to separate the input reading/parsing from the use of the data via an intermediate set of classes. This would actually allow us to change from XML inputs in future (if we wanted/dared/needed to) without changing the internals of QMCPACK.

bkclark commented 6 years ago

Yes. I agree that's an important first step. Even if you were going to leave the input alone, I think it's still worth consider approaching that goal of separating input from use of data by defining an InputClass with an interface something like:

OpenSection("mySectionName") //(returns true or false if it's there)
OpenSection("mySectionName",i) // (returns true or false if the i'th "mySectionName" is there)
CloseSection()
GetVariable("myVariableName")
IsVariable("myVariableName") //(returns true or false if it's there)

Thanks Bryan Clark

prckent commented 6 years ago

Another motivation for improved input validation #619 . When the input processing is reworked we should be alert for error traps / consistency checks to add.

jtkrogel commented 4 years ago

On the topic of wrapper classes to enable validation, I propose we (eventually) do something similar to the following, in which I introduce a single class for an XML element that will actually allow for validation later (i.e. knowing at least what was read and what was not).

Our current XML reading looks like this (and note that even with our current wrapper classes, we still interact directly with libxml2 in what is below):

xmlNodePtr cur(curXml);

std::string type;
std::string source;
int         param1;
RealType    param2;

// read attributes w/ dedicated class, no record of what has been read
OhmmsAttributeSet attrib;
attrib.add(type  ,"type"  );
attrib.add(source,"source");
attrib.put(cur);

// read parameters w/ dedicated class, no record of what has been read
ParameterSet params;
params.add(param1,"param1","int");
params.add(param2,"param2","real");
params.put(cur);

// read child elements
// interact directly w/ libxml2, no wrapper
cur = cur->children;
while (cur != NULL)
{
  // interact directly w/ libxml2, no wrapper
  std::string cname((const char*)cur->name);

  // read attributes w/ dedicated class, no record of what has been read
  std::string pot_type = "none";
  OhmmsAttributeSet attrib;
  attrib.add(pot_type, "type");

  if (cname == "pairpot")
  {
    if (pot_type == "coulomb")
    {
      // process coulomb pair potential
      //  no record of read is made
    }
    else if(pot_type == "mpc")
    {
      // process mpc pair potential
      //  no record of read is made
    }
    else
    {
      // elements not read are silently ignored
    }
  else if(cname == "estimator")
  {
    // process estimators similarly
  }
}

I propose moving to a single class that allows validation, but preserves the current read structure (small change to existing code):

InputElement elem(curXml);

std::string type;
std::string source;
int         param1;
RealType    param2;

// read attributes and document internally what has been read
elem.readAttribute(type  ,"type");
elem.readAttribute(source,"source");

// read parameters and document internally what has been read
elem.readParameter(param1,"param1");
elem.readParameter(param2,"param2");

// read child elements
for(auto pair : elem.elements)
{
  std::string& name = pair.first;
  auto& viewer      = pair.second; // view of the element
  if(name=="pairpot")
  {
    // return attribute, mark as viewed 
    std::string pot_type = "none";
    viewer.viewAttribute(pot_type,"type"); 
    if(pot_type=="coulomb")
    {
      // return child element, mark as read 
      //   also mark viewed attributes/parameters as having been read
      InputElement& elem = viewer.read(); 
      // process coulomb pair potential
    }
    else if(pot_type=="mpc")
    {
      // make record of element read
      InputElement& elem = viewer.read();
      // process mpc pair potential
    }
    else
    {
      // in this case nothing has been read or marked as having been read
      //   can check later if there are unparsed elements
    }
  }
  elif(name=="estimator")
  {
    // process estimators similarly
  }
}
prckent commented 4 years ago

An change in usage patterns to consider: historically all aspects of the wavefunction could be specified in the "input file" XML. However we have been trending towards HDF5 for just about everything since the number of basis functions, determinants etc. is often large even for calculations with low numbers of electrons.

PDoakORNL commented 4 years ago

2024

2007

I think the XML will be "easy" to change to yaml or key value with blocks once the main design issue with the input is dealt with. So specifics of the XML could be improved but from a application architecture standpoint are a separate task that should consider usability within the constraints that a sane application architecture require.

So what are those constraints. Here another stab at that.

  1. The underlying structure of the input is a tree.
  2. Each node of the tree has one and only one consumer.
  3. That consumer needs to convert every leaf of any node it consumes to elements of a native data structure. Any leaf it cannot parse constitutes a fatal error.
  4. An input consumer may pass child nodes to other input consumers. A node it cannot parse or pass to another input consumer constitutes a fatal error.
  5. Whenever possible input structures should be static and descriptive, not dynamic maps or trees themselves.

Lets call this set of datastructures the QMCPACK input structures (QIS). Going forward new objects will never take xmlnode pointers as constructor or put arguments, only QIS or other native types.

How does this effect the input level?

PDoakORNL commented 3 years ago

In light of #2902 I'd like to restate that one goal I have for the input classes is that it be possible to contstruct more QMC input's in C++ without parsing xml.

The initialization of its variables to default or "zero" values should take place in one spot and not require parsing to occur. For use of static analysis tools and even for humans to read the code its better that this occurs using c++ language features and not requiring actual runtime execution. I.e. ParameterSet doing validation (fine) ParameterSet being the method of default intitialization (sketchy).

see QMCDriverInput.h

prckent commented 3 years ago

@PDoakORNL agreed.

jtkrogel commented 2 years ago

For the input parsing/storing classes, I think code similar to the following should be used in the base class (map/hash table of arbitrary type) in order to minimize boiler plate code (C++17):

    std::map<std::string, std::any> notebook;

    std::string name{ "Pluto" };
    int year = 2015;

    notebook["PetName"] = name;
    notebook["Born"] = year;

    std::string name2 = std::any_cast<std::string>(notebook["PetName"]); // = "Pluto"
    int year2 = std::any_cast<int>(notebook["Born"]); // = 2015

A good time to revisit this will be following the eventual deletion of legacy drivers.