Unidata / UDUNITS-2

API and utility for arithmetic manipulation of units of physical quantities
http://www.unidata.ucar.edu/software/udunits
Other
62 stars 36 forks source link

Feature Request: Option to not need external xml #71

Open schwehr opened 6 years ago

schwehr commented 6 years ago

(sorry for the flood of new issues... trying to get them all doc'ed as I work through the code)

I don't have local filesystem available when running in a production container, so I won't have access to the xml config files. I hacked my build to always pull from an in-memory copy of the file data, but it's sadly proprietary code. It would be great to have a public version that anyone can use. An alternative approach that might be even better would be to create compilable data structures that contain the results of parsing the xml. That would allow folks who don't need/don't want/or can't use the xml data files to completely drop the need for expat and the code in xml.c.

My alterations to xml.c:

static ut_status readXmlWithParser(XML_Parser parser, const char* const  path)
{
    // Always pull the xml files out of an embed data catalog.
    size_t size = 0;
    const char *buf = UdunitsGetFileContents(path, &size);
    if (buf == NULL) {
      const ut_status status = UT_PARSE;
      ut_set_status(status);
      ut_handle_error_message("file not found");  // Should do a better error message
      return status;
    }

    File* const prevFile = currFile;

    File file;
    fileInit(&file);
    file.path = path;
    file.parser = parser;
    currFile = &file;

    const int is_final = 1;
    if (XML_Parse(parser, buf, (int)size, is_final) != XML_STATUS_OK) {
      const ut_status status = UT_PARSE;
      ut_set_status(status);
      ut_handle_error_message(XML_ErrorString(XML_GetErrorCode(parser)));
      return UT_PARSE;
    }

    currFile = prevFile;

    return UT_SUCCESS;
}

and

static const char*
default_udunits2_xml_path()
{
  static const char kPath[] = "udunits2.xml";
  return path;
}
semmerson commented 6 years ago

@schwehr Interesting! We created an external file because our clients wanted to be able to easily modify the database. Now you want the opposite.

You must be doing embedded stuff.

I'm open to suggestions on how to turn the internal, binary unit database into a ".o" file -- preferably without having to learn and understand a ".o" format (actually, that pretty much a requirement as I just don't have the time).

schwehr commented 6 years ago

Not embedded stuff. Containers (ala docker) don't always have any local writable disk. I'm working on the google (non-public) cloud. I wasn't meaning anything about .o formats. I was thinking that there could be a program that uses xml.c to parse the xml and then emit C code that is the resulting structure. Since udunits already has code to read the xml format and make the data structure. We just need to having something that can walk the data structure and make an initializer in C.

semmerson commented 6 years ago

@schwehr I see. I'm afraid I'll have to leave that as a exercise for the reader as I just don't have the time.

schwehr commented 6 years ago

Please don't close the bug unless you are against having this option. I was going to try to see if I could pull off this sometime soon.

semmerson commented 6 years ago

@schwehr Does closing it affect your development environment?

Re-opened.

schwehr commented 6 years ago

It doesn't change my development env, but closing a bug is a signal that the feature is done and merged into master or a strong signal that a feature has been rejected by the project.

schwehr commented 6 years ago

I tried to give a look at how I might do this, but I ran out of time. Here are some notes. I'm not sure I really follow what's going on. Please correct anything that I have wrong.

It looks like core data structure of udunits is binary search trees, based on algorithms T and D from Knuth (6.2.2) as implemented in:

tdelete, tfind, tsearch, twalk -- manipulate binary search trees

I believe these are stored in static variables in unitToldMap.c:

static SystemMap*   systemToUnitToName = NULL;
static SystemMap*   systemToUnitToSymbol = NULL;

tfind and friends use void*. The binary tree nodes consist of ut_unit from unitcore.c:

union ut_unit {
  Common common;
  BasicUnit basic;
  ProductUnit product;
  GalileanUnit galilean;
  TimestampUnit timestamp;
  LogUnit log;
};

If I can make a series of static lists of Common, BasicUnit, ProductionUnit, GalileanUnit, TimestampUnit, and LogUnit, I then have to load those into the binary tree.

I'm having trouble seeing in xml.c where/how all of the units are inserted into the tree.

Next, I tried looking at the coverage from this googletest case:

#include "testing/base/public/gunit.h"

#include "third_party/udunits/lib/udunits2.h"

namespace {

TEST(XmlTest, UtReadXml) {
  ut_system *unit_system = ut_read_xml(nullptr);
  ASSERT_NE(nullptr, unit_system);
  ut_free_system(unit_system);
}

}  // namespace

That covers 44% of the udunits code base, so it's not particularly useful in narrowing down things. Modifying udunits2.xml to narrow the scope of things helps a bit. e.g. ut_add_symbol_prefix is they key for catching what prefixes are added.

static void endSymbol(void *data) {
    if (currFile->context == PREFIX) {
        if (ut_add_symbol_prefix(unitSystem, text, currFile->value)
<?xml version="1.0" encoding="US-ASCII"?>
<unit-system>
    <import>udunits2-prefixes.xml</import>
    <!-- <import>udunits2-base.xml</import> -->
    <!-- <import>udunits2-derived.xml</import> -->
    <!-- <import>udunits2-accepted.xml</import> -->
    <!-- <import>udunits2-common.xml</import> -->
</unit-system>

Then instrumenting the code to emit what it would do like this. I wouldn't really want to do this. It could be wrapped in a #ifdef UDUNITS_EMIT_CODE_FROM_XML or have a global state and do if(emitCodeFromXml) ..., but that seems ugly.

ut_status ut_add_symbol_prefix(ut_system* const system, const char* const symbol, const double  value) {
  fprintf(stderr, "ut_add_symbol_prefix(system, \"%s\", %lg);\n", symbol, value);
  ut_set_status(addPrefix(system, symbol, value, &systemToSymbolToValue, pseSensitiveCompare));
  return ut_get_status();
}

Yields:

ut_add_symbol_prefix(system, "Y", 1e+24);
ut_add_symbol_prefix(system, "Z", 1e+21);
ut_add_symbol_prefix(system, "E", 1e+18);
ut_add_symbol_prefix(system, "P", 1e+15);
ut_add_symbol_prefix(system, "T", 1e+12);
ut_add_symbol_prefix(system, "G", 1e+09);
ut_add_symbol_prefix(system, "M", 1e+06);
ut_add_symbol_prefix(system, "k", 1000);
ut_add_symbol_prefix(system, "h", 100);
ut_add_symbol_prefix(system, "da", 10);
ut_add_symbol_prefix(system, "d", 0.1);
ut_add_symbol_prefix(system, "c", 0.01);
ut_add_symbol_prefix(system, "m", 0.001);
ut_add_symbol_prefix(system, "µ", 1e-06);
ut_add_symbol_prefix(system, "μ", 1e-06);
ut_add_symbol_prefix(system, "u", 1e-06);
ut_add_symbol_prefix(system, "n", 1e-09);
ut_add_symbol_prefix(system, "p", 1e-12);
ut_add_symbol_prefix(system, "f", 1e-15);
ut_add_symbol_prefix(system, "a", 1e-18);
ut_add_symbol_prefix(system, "z", 1e-21);
ut_add_symbol_prefix(system, "y", 1e-24);

My next thought was from looking at . I think I can just log all the calls in idToUnitMap.c to things of the pattern ut_map*

static void test_utNewBaseUnit(void)
{
    kilogram = ut_new_base_unit(unitSystem);
    CU_ASSERT_PTR_NOT_NULL(kilogram);
    CU_ASSERT_EQUAL(ut_map_unit_to_name(kilogram, "kilogram", UT_ASCII), UT_SUCCESS);
    CU_ASSERT_EQUAL(ut_map_unit_to_symbol(kilogram, "kg", UT_ASCII), UT_SUCCESS);
    CU_ASSERT_EQUAL(ut_map_symbol_to_unit("kg", UT_ASCII, kilogram), UT_SUCCESS);
cesss commented 5 years ago

I would be also very interested in having the possibility of building the UDUNITS2 library in a monolitic way without xml files, for two reasons:

  1. Depending on expat is a drawback (I'd rather use a simpler, single-file parser).
  2. Using UDUNITS2 from an application is easier if you just need to link with the library and not having to worry in including the XML files in the distribution.

I understand the wish for extending the database, but in some cases it's not needed (or even not desired: some application could become broken if the end users remove critical units that the app depends upon, or if they add units unexpected by the app).

I also understand the XML choice because of its power and flexibility, but it's a heavy format, and expat is a "beast". I'd rather prefer a very simple syntax, json or similar, that can be read with a single file parser.