Cantera / enhancements

Repository for proposed and ongoing enhancements to Cantera
11 stars 5 forks source link

What should we do with all our input files? #22

Closed bryanwweber closed 7 months ago

bryanwweber commented 4 years ago

Abstract

We have many input files that are just committed to the repository and others that are distributed with Cantera. We should discuss which of these should be deprecated and removed, which should be removed without deprecation, and where the files that are staying should live.

Motivation

In Cantera/cantera#768, we determined that several input files should be deprecated and/or removed. However, due to the size of that PR, specific discussion was deferred. This issue is where I propose to have the relevant discussion.

At present, the following input files are committed to the repository:

The list of files in the data folder
data
├── inputs
│   ├── KOH.cti
│   ├── air.cti
│   ├── air.inp
│   ├── airNASA9.cti
│   ├── airNASA9.inp
│   ├── argon.cti
│   ├── argon.inp
│   ├── critProperties.xml
│   ├── diamond.cti
│   ├── elements.xml
│   ├── graphite.cti
│   ├── gri30.cti
│   ├── gri30.inp
│   ├── gri30_highT.cti
│   ├── gri30_ion.cti
│   ├── h2o2.cti
│   ├── h2o2.inp
│   ├── liquidvapor.cti
│   ├── lithium_ion_battery.cti
│   ├── methane_pox_on_pt.cti
│   ├── nDodecane_Reitz.cti
│   ├── nasa.cti
│   ├── nasa_condensed.cti
│   ├── nasa_gas.cti
│   ├── ohn.cti
│   ├── ptcombust.cti
│   ├── silane.cti
│   ├── silane.inp
│   ├── silicon.cti
│   ├── silicon_carbide.cti
│   ├── sofc.cti
│   └── water.cti
├── thermo
│   ├── airDataNASA9.dat
│   ├── gri30_thermo.dat
│   └── nasathermo.dat
├── transport
│   ├── gri30_tran.dat
│   └── misc_tran.dat
├── KOH.yaml
├── air.yaml
├── airNASA9.yaml
├── argon.yaml
├── diamond.yaml
├── graphite.yaml
├── gri30.yaml
├── gri30_highT.yaml
├── gri30_ion.yaml
├── h2o2.yaml
├── liquidvapor.yaml
├── lithium_ion_battery.yaml
├── methane_pox_on_pt.yaml
├── nDodecane_Reitz.yaml
├── nasa.yaml
├── nasa_condensed.yaml
├── nasa_gas.yaml
├── ohn.yaml
├── ptcombust.yaml
├── silane.yaml
├── silicon.yaml
├── silicon_carbide.yaml
├── sofc.yaml
└── water.yaml

There are several "classes" of files here:

Some input files might be cross some of these lines (e.g., liquidvapor.yaml), so the distinctions are a little fuzzy.

Possible Solutions

To get the discussion started, I'll propose the following changes:

  1. Remove all CHEMKIN input files from the data folder. They are (or will shortly be) no longer needed at build time, so they will become clutter. Select files should be moved to test/data for tests of the conversion scripts.
  2. Move input files that are used in one example to the folder with that example. IMO, the word data implies that the files therein are more important somehow, and moving the input files to an example-specific folder removes some of the implicit endorsement that we are giving to these files by distributing them. a. This is mildly complicated because several files are used in several examples across a few interfaces
  3. Data files elements.xml, critProperties.xml, and liquidvapor.yaml remain in the data folder
  4. All other files that are not used in an example are deprecated and removed

I'm not particularly attached to this arrangement, just throwing it out there to get something started. I'm also not particularly attached to this happening on any particular timeline, so discussion is more than welcome.

speth commented 4 years ago
  • Remove all CHEMKIN input files from the data folder. They are (or will shortly be) no longer needed at build time, so they will become clutter. Select files should be moved to test/data for tests of the conversion scripts.

Agreed.

  • Move input files that are used in one example to the folder with that example. IMO, the word data implies that the files therein are more important somehow, and moving the input files to an example-specific folder removes some of the implicit endorsement that we are giving to these files by distributing them. a. This is mildly complicated because several files are used in several examples across a few interfaces

I also worry about the implicit endorsement that we give to certain input files. However, I think splitting the data files up among the examples would lead to a mess of duplicate and near-duplicate files. The test_problems directory might serve as a cautionary warning in this regard, since many of those tests use input files in the same directory as the test, and there are a ton of duplicates. Besides duplication across similar examples in different language interfaces, there are a lot of examples that could (and often already do) use the same input files, for which the shared data directory is undeniably convenient.

  • Data files elements.xml, critProperties.xml, and liquidvapor.yaml remain in the data folder

Agreed. elements.xml is destined for removal with the end of the XML format, with no need for a YAML replacement.

  • All other files that are not used in an example are deprecated and removed

Agreed. I assume that the plan would be to add deprecation warnings to the affected .yaml input files, and just let all of the .cti and .xml files go silently when those formats are removed in Cantera 3.0?

ischoegl commented 4 years ago

For CHEMKIN input, I do think that there is a place in examples (rather than removing them): see proposed input example in Cantera/cantera#777

ischoegl commented 4 years ago

Regarding sample input files, one option would be to be explicit in the description fields that they are not meant for production. I drafted a description for Cantera/cantera#789 to provide input to this discussion.

bryanwweber commented 4 years ago

For example/sample input files, what about adding a folder: samples/input-files that is copied into the appropriate build directories.

For example, instead of putting the example input files in build/python/cantera/data, they could go in build/python/cantera/data/samples or build/python/cantera/examples/input-files. The trouble with the latter of these is that that folder wouldn't be on the search path; I imagine we could either add it in every example file (noise, don't like it) or add that folder to the search path in __init__.py (then it is always available, but that seems fine).

I'm not sure how this would work for the other interfaces, though. Maybe just adding a samples subdirectory below the top-level data folder? Is the search for input files recursive in the data directory?

I recognize this is basically a semantic difference, so I don't want to draw out a long discussion. Just trying to think of any simple alternatives, in addition to or in place of editing the files themselves with more context.

I assume that the plan would be to add deprecation warnings to the affected .yaml input files, and just let all of the .cti and .xml files go silently when those formats are removed in Cantera 3.0?

Yes, this was what I was thinking.

ischoegl commented 4 years ago

Not having input files on the search path would likely be confusing to many people. An alternative 'samples' folder could be added to the search path (in case it's not recursive already), but there wouldn't be any difference in user handling.

I honestly only see two alternatives:

  1. Implement SConstruct in a way that it automatically distributes input files to various locations (e.g. use place-holder files that are replaced as needed during the build process)
  2. Move everything to the data folder, and assume that users will read the description block. Long-term, it would be great if the description field is accessible from the Python front end.

Option (1) has the advantage that input files won't be available outside of the folder a sample resides in, whereas (2) is simple and easy to maintain (while assuming some user awareness).

PS: adding a relative path to input files is a third option (3), which however reduces code legibility

PPS: Option (1) also has the disadvantage to clutter folders containing multiple examples (e.g. in Python).

bryanwweber commented 4 years ago

@ischoegl

Not having input files on the search path would likely be confusing to many people.

Yes, I agree, and I think we should avoid doing that 😄

Let me clarify my suggestion (as it has become clearer to me in the last few hours). I am suggesting that we commit files like nDodecane_Reitz.yaml to a new folder samples/input-files (alternatively, data/samples but I prefer the former). At build time, the files in this folder are copied to build/data/samples and then get installed into the proper folders (e.g., for Python they would go to cantera/data/samples). The only thing that might need to change within Cantera is that the data directory is searched recursively, a change that should be transparent to most users. IMO, this has the (my) intended effect of clarifying that these input files are included for convenience and not because we endorse them as fundamentally correct. But like I said, this is arguing semantics and is quite subtle (especially to an international user base), so I won't pursue this idea further if we think it doesn't actually clarify the distinction I'm trying to make.

As another point, I think we should try to reduce the overall number of example-related input files in the repository. For instance, we have two different versions of Deutshmann's methane-on-Pt mechanism. I don't see a reason why the example using the older ptcombust.yaml shouldn't be updated to use methane_pox.yaml, and the former file deprecated and removed. Other examples can likely have consolidated input files (perhaps using GRI-3.0 or another file we plan to keep) so we can reduce the total number of example input files.

Last, for the record, the case that best represents what I really want from this change is that these types of input files go into build/python/cantera/examples/input-files, which is then added to the search path in __init__.py. The problem with this approach is that it doesn't work for the other interfaces, I don't think, so it is kind of a dead-end.

Regarding Cantera/cantera#789, it is likely that the best place for that input file to go will be this new structure that we're discussing here, or else the sample should be converted to use a different input file (I don't recall which phases/thermo are specified in that example, so I'm not sure if we have an existing data file that would suffice).

ischoegl commented 4 years ago

@bryanwweber ... I see what you’re trying to accomplish, but I am unsure that the differentiation won’t be lost on the users. If the search path is augmented, then there won’t be a difference in usage, while it will be more difficult for users to locate the actual data folder. I.e. I’m not sure that people will pick up on this suggested change in the intended way.

Since your suggestion is closer to (~2~1), let me provide additional detail for a potential alternative implementation: (a) collect all sample input files in a central location (nDodecane_Reitz.yaml, etc.) as you suggest, (b) create empty placeholder files, e.g. nDodecane_Reitz.input, in sample folders where they are used, and (c) useSConstruct to replace the placeholder files with the actual input files during the build process. That way there will be a guarantee that all copies are identical.

Overall, I don’t think that the storage location will make a huge difference, as I ultimately believe that a clarification in the yaml description should suffice to steer users away from using input files that are not intended for production.

Further: As was mentioned elsewhere, even GRI30 is not up to par any longer. I guess adding a clarification to aREADME.md file in the data folder may be another part of the solution. Edit: honestly I think (2) is the easier way to handle this.

speth commented 4 years ago

I agree with @ischoegl that moving the files to a new directory but still including that directory on the default search path wouldn't be something most users notice.

As an alternative, what if we moved sample data to the data/samples directory as @bryanwweber suggested, but modified the path search algorithm slightly so that you could load input files in that folder by writing ct.Solution('samples/ptcombust.yaml')? That way, it would be more clear to the user that they are loading sample data rather than some officially-blessed input files. Also, I think it would then be reasonable to expect that those samples could change or be removed in the future without warning.

I think there's a lot of utility in having at least a few input files that can be used right out of the box for reasonable calculations, if not necessarily cutting-edge research, especially if they are from reputable sources, have descriptions of their origins, and are named appropriately (e.g. I think gri30 and nDodecane_Reitz are decent names, while ptcombust is too generic).

ischoegl commented 4 years ago

@speth ... now that you're writing it out, the relative path doesn't look as awkward as I anticipated. I'd still suggest loading things as ct.Solution('sampledata/ptcombust.yaml') or similar as users won't see the root folder name data, and samples is not sufficiently differentiated.

As a likely controversial comment, I am wondering about the rationale between the split of sample locations for different platforms (i.e. C++ samples, Python samples, etc.)? Long-term, it may be simpler to just have one root folder holding all sample variants? In this case, there may be an obvious location for sample data that is shared for all sample versions (which may deviate from the current discussion).

bryanwweber commented 4 years ago

@speth

As an alternative, what if we moved sample data to the data/samples directory as @bryanwweber suggested, but modified the path search algorithm slightly so that you could load input files in that folder by writing ct.Solution('samples/ptcombust.yaml')?

I very much like that solution (no pun intended). We'd just need to make sure that this doesn't hide a local samples folder that the user might have (which I think is being handled elsewhere).

I think there's a lot of utility in having at least a few input files that can be used right out of the box for reasonable calculations

I agree, and I think @ischoegl suggestion to include information about that in the description field is appropriate too. Nonetheless, I think we should make an effort to "deduplicate" some of the files that we have.

@ischoegl

...the rationale between the split of sample locations for different platforms... In this case, there may be an obvious location for sample data that is shared for all sample versions (which may deviate from the current discussion)

I'm not sure what the rationale was at the time it was done 🤷‍♂ I think this does deviate significantly enough from the current discussion that it's worth a separate issue.

ischoegl commented 4 years ago

@bryanwweber ... per your suggestion, I added (edit: and closed) a new issue #27 for discussion.

I think we should make an effort to "deduplicate" some of the files that we have.

I think this is a very good idea.

speth commented 4 years ago

@speth ... now that you're writing it out, the relative path doesn't look as awkward as I anticipated. I'd still suggest loading things as ct.Solution('sampledata/ptcombust.yaml') or similar as users won't see the root folder name data, and samples is not sufficiently differentiated.

Sure, I was mainly attempting to describe the mechanics of the approach, not settle on the specific directory name. But I agree that sampledata would be better, or perhaps even sample-data. Also, sorry for missing that you had mentioned this option.

I very much like that solution (no pun intended). We'd just need to make sure that this doesn't hide a local samples folder that the user might have (which I think is being handled elsewhere).

I don't think there's any risk of that. As I envision it, if your Cantera search path is:

Nonetheless, I think we should make an effort to "deduplicate" some of the files that we have.

If that wasn't sufficiently clear, I am in no way opposed to deduplication.

One implementation issue, then, is how to handle moving sample data files from their current location to the sample-data subdirectory. The best I can come up with at the moment would be to make a check in Application::findInputFile for any of the file names that have moved and issue a deprecation warning.

ischoegl commented 4 years ago

@speth ... looks like an excellent solution (I assume you're not suggesting a hard-coded folder name, i.e. sample-data/foobar.yaml is simply a path relative to search path folders).

bryanwweber commented 4 years ago

@speth I agree with everything you've got there.

@ischoegl I'm not sure what you mean by "hard-coded", but I think the folder will be /path/to/Cantera/install/data/sample-data and lib/python3.X/site-packages/cantera/data/sample-data. All of the examples will need to be edited to use this new path.

In terms of a tactical approach to this, I'd suggest having two pull requests: 1) Remove/move CK input files and deprecate YAML files that will be removed and 2) Move example input files and implement the necessary checks and changes to the examples. Hopefully that will limit the scope of each and make them easier to review...

AdityaSavara commented 3 years ago

Some input of a regular user:

I was trying to follow this example: https://cantera.org/examples/python/reactors/surf_pfr.py.html And I was confused about where to find the yaml file (because it has been a couple years from the last time I tried to do a cantera example). I would find it less confusing if the github repository had a directory called "examples". There could be a readme in that subdirectory indicating that the examples are for syntax and formatting demonstration and not an endorsement of being physically meaningful or physically complete examples.

Somewhere in these help pages, there should also be one (or several) links to where to find the yaml files. https://cantera.org/examples/index.html

speth commented 7 months ago

Given the goal of Cantera/cantera#1681 of converting the Jupyter notebook examples into Python examples in the main repo, there is the need to deal with the input files used by a couple of these examples. Specifically, stirred_reactor.ipynb is a nice example that makes comparisons between experimental and numerical simulation data using a mechanism from the literature which happens to be fairly large. The mechanism file, galway.yaml, has 1268 species and 5336 reactions, and weighs in 1.2 MB, more than all the input files in the main cantera data directory combined.

Beyond just this one case, I think there's value in having more examples that make use of more up-to-date mechanisms than GRI 3.0, but I'd like to keep what's included in the base data directory limited.

One option that comes to mind is to create a separate cantera-sample-data repository and configure it as a submodule in the main repository. By using a submodule, some of the bloat can be avoided by either not fetching the submodule, or by doing just a shallow clone.

We can then make whether and how to install this data a SCons option, and configure it as we like for different packaging routes. For example, if we want to keep the PyPI package light, we can exclude this data (since the examples that need it aren't part of the PyPI package anyway). But for Conda or Ubuntu where there are finer-grained packaging controls, we could build an optional cantera-sample-data package.

ischoegl commented 7 months ago

One option that comes to mind is to create a separate cantera-sample-data repository and configure it as a submodule in the main repository. By using a submodule, some of the bloat can be avoided by either not fetching the submodule, or by doing just a shallow clone.

I like this suggestion quite a bit. Hopefully, we can convince researchers to contribute their own mechanisms so this will become 'community-maintained', although I'm not necessarily having high hopes for that to happen. In an ideal world ... .

There are some community members who have created collections of mechanisms, but I'm not sure that this is an objective as we do not want to (implicitly) recommend some mechanisms over others?

speth commented 7 months ago

I've definitely had some concerns about endorsing the use of specific mechanisms, but the existing CollectionOfMechanisms repository that @jiweiqi created has been around for a while now and doesn't seem to have generated any controversy that I'm aware of. It has also seen at least a few contributions from other researchers.

The suggestion I'm making here, though, is focused on data used in our examples, not a broader repository of potentially-useful mechanisms. However, building an easy-to-install package that would make a larger set of mechanisms available on the Cantera data path could be interesting as well.

bryanwweber commented 7 months ago

and configure it as a submodule in the main repository.

Another option is to use git LFS, although I see you've already set up the new repo so this may be a bit late 😊

speth commented 7 months ago

I wouldn't say it's too late -- there's nothing stopping us from just deleting that repo.

I did consider git LFS, but hesitated due to lack of experience with it and not having a good idea of whether there were any pitfalls to be aware of. If you've used it and found it easy to work with, that would be a good enough reason to try it out here.

rwest commented 7 months ago

I have not found git LFS to be a good solution. More like a necessary evil when you have gigabyte-sized files. The 1.2mb yaml file probably compresses reasonably well (git zips all blobs) and very rarely changes so won't add much bloat to the repo. But a separate repo of models is worth exploring, especially if we want to collect many.

ischoegl commented 7 months ago

I would still be :+1: with the submodule (despite having used git LFS in the past without problems). My question would be whether some of the mechanisms in data should be moved to the submodule? Also, should it be just everything in the root folder (as in the current draft) or should there be some internal structure with comments, etc.?

speth commented 7 months ago

Thanks for the feedback on git LFS, @rwest. While adding this one file certainly wouldn't make a significant difference, what I'm hoping to get out of doing this is to relieve the tension between using a wider variety of reasonably up-to-date mechanisms in the examples and bloating the main repo.

My question would be whether some of the mechanisms in data should be moved to the submodule?

Potentially, but most of the mechanisms in the data folder are also used in some of the unit tests, and I think the test suite should still run without needing anything from this separate repository (which I realize affects the portion of the test suite that consists of running the examples).

Also, should it be just everything in the root folder (as in the current draft) or should there be some internal structure with comments, etc.?

My thought with keeping the structure flat is that it would then let users load the mechanisms as ct.Solution('example_data/whatever_mech.yaml'). Of course the mechanisms should have their description fields filled out. Other than that, I'm not sure how I'd want to organize the files.

ischoegl commented 7 months ago

Potentially, but most of the mechanisms in the data folder are also used in some of the unit tests, and I think the test suite should still run without needing anything from this separate repository (which I realize affects the portion of the test suite that consists of running the examples).

I am not necessarily concerned about unit tests requiring a submodule (which I believe makes sense). The bigger concern I have is packaging: as an example, conda packages distribute Python samples. At the same time, it's pretty simple to copy things over during the installation/packaging process, which is probably the obvious solution here.

speth commented 7 months ago

I am not necessarily concerned about unit tests requiring a submodule (which I believe makes sense).

I see that as a fairly clean place to draw a dividing line. The main repo should contain everything (save external dependencies) needed to compile and test Cantera. This is important for packaging systems such as Ubuntu/Launchpad where the expectation is to build from a tarball, not a Git repository.

If we really want to clarify that some of the input files currently in the main data directory are just there for example purposes, we could move them to test/data (which doesn't get installed) and also copy them into the cantera-example-data repository, where they would then be loaded as example_data/whatever.yaml.

The bigger concern I have is packaging: as an example, conda packages distribute Python samples. At the same time, it's pretty simple to copy things over during the installation/packaging process, which is probably the obvious solution here.

Yes, my original suggestion was to have SCons handle this as an option during installation, though in some cases it may end up falling to the scripts responsible for building each distinct package.

ischoegl commented 7 months ago

If we really want to clarify that some of the input files currently in the main data directory are just there for example purposes, we could move them to test/data

Arguably, it may make sense to create a new samples/data folder, which could be a sub repo if you want to keep it flat. test/data really should be reserved for unit tests. There are several existing mechanisms that fall into this 'examples' category, see for example nDodecane_Reitz.yaml, oxygen-plasma-itikawa.yaml, etc.

AdityaSavara commented 7 months ago

I'm not sure I caught all the details, I've had the flu but wanted to add the following for consideration: (1) If using github, one option would be to use a git ignore. Though a separate repository is probably a better practice. (2) What I've been doing (for other packages) is using a script that is like a wrapper for installation, so that when people run the default script (let's call it "complete") it downloads the examples, too. Then there is an installation script called "minimal" (which the complete one calls upon). (3) The separate repository -- for something as big as cantera, and with the type of usage of cantera -- could follow the model that plumed NEST follows (which is for molecular dynamics). Contributions there require a publication (though apparently arxiv is fine) and metadata https://www.plumed-nest.org/browse.html https://www.plumed-nest.org/eggs/24/004/

So the plumed-nest model is what I would advocate for. In fact, I wanted to make something like that for this field of work even before plumed existed. A separate repository with something simpler than plumed-nest [a simpler contribution form without more than regex validations] could be a foundation for something like plumed-nest.