In quda_work branch, parse_quda_mg_dbl_par_array is unable to parse scientific notation

kostrzewa commented 4 years ago

Wanted to enlist some help with solving a particularly annoying issue that has come up with the development of more parsing functionality for the tmLQCD QUDA interface.

Per MG-level parameters, which are specified as something like

MGCoarseSolverTolerance = 0.25, 0.25, 0.45

are now set using parse_quda_mg_XXX_par_array utility functions which nicely encapsulate the tokenization of the comma-separated list. However, for parsing lists of doubles (or integers), I've encountered the problem that the tokenizer does not want to work for scientific notation, splitting something like 1e-4 into 1e and 4.

See code and utility functions below. Any idea on how this could be resolved? This becomes quite annoying when specifying the setup solver tolerance, which is between 1e-8 and 1e-6, but also for the MGEigSolver functionality, small tolerances are required.

static inline char * strlist_tokenize(char const * const input, 
                                      char * const paramname,
                                      const int paramname_length)
{
  char error_message[200];
  char * token = (char*)NULL;

  const int input_length = strlen(input) + 1;
  if( input_length > 20000 ){
    yy_fatal_error("Likely overflow bug in 'strlist_tokenize' of 'read_input.l'.\n"
                   "Encountered string of length exceeding 20k characters!\n");
  }
  char * input_copy = malloc(input_length);
  if( input_copy == NULL ){
    yy_fatal_error("Could not allocate memory for 'input_copy' in 'strlist_tokenize' of 'read_input.l'!\n");
  }
  strncpy(input_copy, input, input_length);

  // the first token is the parameter name, we will return it, 
  // but first check if tokenization was successful
  token = strtok(input_copy, "\n\t =,\\");
  if( (void*)token == NULL ){
    snprintf(error_message, 200, "Unable to parse '%s' in strlist_tokenize!\n", paramname);
    yy_fatal_error(error_message);
  } else {
    // return what we parsed
    snprintf(paramname, paramname_length, "%s", token);
  }
  return input_copy;
}

/* obtain the next token in the list of comma-separated strings
   'list_end' is a parameter which is set to 1 if the end of the list is reached
   'token' will hold the value of the token
   'token_length' is the length of the array to which the token should be written */

static inline void strlist_next_token(int * const list_end, 
                                      char * const token_out,
                                      const int token_length)
{
  char * token = (char*)NULL;
  token = strtok(NULL," =,\t");
  if( (void*)token == NULL ){
    *list_end = 1;
  } else {
    *list_end = 0;
    snprintf(token_out, token_length, "%s", token);
  }
}

static inline double fltlist_next_token(int * const list_end){
  double retval;
  char * token = (char*)NULL;
  token = strtok(NULL," ,=\t");
  if( (void*)token == NULL ){
    *list_end = 1;
    return(0.0);
  } else {
    *list_end = 0;
    sscanf(token,"%lf",&retval);
    return(retval);
  }
}

static inline void parse_quda_mg_dbl_par_array(char const * const input,
                                               double * par_array)
{
  char paramname[100];
  char error_message[200];
  int list_end = 0; 
  int level = 0; 

  char * input_copy = strlist_tokenize(input, paramname, 100);
  double parval = fltlist_next_token(&list_end);
  while( list_end != 1 ){ 
    if( level >= QUDA_MAX_MG_LEVEL ){
      snprintf(error_message, 200, 
               "Exceeded maximum number of levels (%d)"
                 " parsing %s!\n", QUDA_MAX_MG_LEVEL,
              paramname);
      yy_fatal_error(error_message);
    }

    par_array[level] = parval;
    if(myverbose){ 
      printf("  %s, level %d set to %f line %d\n", paramname,
              level, par_array[level], line_of_file);
    }

    level++;
    parval = fltlist_next_token(&list_end);
  }
  free(input_copy);
}

kostrzewa commented 4 years ago

@marcuspetschlies I just noticed that we had you in the ETMC organisation with the wrong user name...

martin-ueding commented 4 years ago

So you basically want to have this in C?

def parse(string):
    return list(map(float, string.split(', ')))

parse <- function (string) {
    as.numeric(str_split(string, ', '))
}

kostrzewa commented 4 years ago

It's already in there, the question is why strtok tokenizes on the 'e'...

martin-ueding commented 4 years ago

What does token = strtok(NULL, " =,\t"); do? This looks rather fishy.

martin-ueding commented 4 years ago

Oh. Documentation

Alternativelly, a null pointer may be specified, in which case the function continues scanning where a previous successful call to the function ended.

What a nice API …

kostrzewa commented 4 years ago

What a nice API …

Sure, C API is nasty for historical/efficiency reasons, but it's what we're stuck with in this case.

martin-ueding commented 4 years ago

I would think that the error is something else. Using a minimal example I can happily split such an input string.

#include <stdio.h>
#include <string.h>

// The `strtok` function will insert `\0` terminators into the given string.
// This function will still print the whole string as we know the length,
// emitting a `\0` when it finds a `\0`.
void force_print(char const *const str,
                 int const len,
                 char const *const message) {
  printf("\n%s\n  ", message);
  for (int i = 0; i < len; ++i) {
    if (str[i] == '\0') {
      printf("\\0");
    } else {
      printf("%c", str[i]);
    }
  }
  printf("\n");
}

int main() {
  // We must have a mutable string for `strtok` to work.
  char input[] = "residual = 1e-3, 3.2e-4, 9.8e-12";
  int const len = strlen(input);
  force_print(input, len, "Our original string:");

  // We initialize the tokenization here with a copied pointer such that we can
  // still print out the whole original string.
  char *token;
  char *it = input;
  token = strtok(it, " =,\t");
  force_print(input, len, "After initial strtok:");
  printf("Token: `%s`\n", token);

  do {
    token = strtok(NULL, " =,\t");
    force_print(input, len, "After follow-up strtok:");
    printf("Token: `%s`\n", token);
  } while (token != NULL);

  return 0;
}

The output seems to be just what we want:

$ gcc -Wall -Wpedantic -fsanitize=address toc_test.c; and ./a.out

Our original string:
  residual = 1e-3, 3.2e-4, 9.8e-12

After initial strtok:
  residual\0= 1e-3, 3.2e-4, 9.8e-12
Token: `residual`

After follow-up strtok:
  residual\0= 1e-3\0 3.2e-4, 9.8e-12
Token: `1e-3`

After follow-up strtok:
  residual\0= 1e-3\0 3.2e-4\0 9.8e-12
Token: `3.2e-4`

After follow-up strtok:
  residual\0= 1e-3\0 3.2e-4\0 9.8e-12
Token: `9.8e-12`

After follow-up strtok:
  residual\0= 1e-3\0 3.2e-4\0 9.8e-12
Token: `(null)`

So I would say that the strtok function works as it should and therefore there must be something wrong somewhere else in the logic.

kostrzewa commented 4 years ago

Thanks for the MWE, I agree that the problem must lie somewhere else then, although I also printed the tokens and it was splitting after the 'e'. Perhaps flex is mangling the input string somehow.

kostrzewa commented 4 years ago

Alright, I got the bugger. The pattern for STRLIST was not defined correctly, causing flex to break the input string at the e for reasons that I don't fully understand. Thanks @martin-ueding

etmc / tmLQCD

In quda_work branch, parse_quda_mg_dbl_par_array is unable to parse scientific notation #471