ShabbyX / libpandoc

C bindings to Pandoc, a markup converter library written in Haskell.
88 stars 16 forks source link

Multiple consecutive `pandoc_init` and `pandoc_exit` #9

Closed Phyks closed 7 years ago

Phyks commented 7 years ago

Hi,

I may have missed something, but I would have expected something like this to be working

    for (int i = 0; i < 2; ++i) {
        pandoc_init();
        pandoc(BUFFER_LENGTH, "markdown", "plain", NULL, &reader, &writer, NULL);
        pandoc_exit();
    }

That is, to convert multiple strings, I could just pandoc_init, convert, and pandoc_exit and then do all of this again.

But it does not seem to work, and I get a segfault. I am not sure exactly why this is happening, it seems to be coming from the haskell platform.

Thanks!

ShabbyX commented 7 years ago

I would have expected this to work as well! I'll give it a try in the weekend and let you know.

ShabbyX commented 7 years ago

In the meantime, why not do it like this?

pandoc_init();

for (int i = 0; i < 2; ++i)
    pandoc(BUFFER_LENGTH, "markdown", "plain", NULL, &reader, &writer, NULL);

pandoc_exit();

There shouldn't be any reason why you would need to re-initialize the Haskell runtime.

Phyks commented 7 years ago

Actually, the problem is a bit more subtle than this :)

I am willing to use Pandoc to convert LaTeX output to plain text. The overhead of calling Pandoc through a subprocess for every string to be converted is too much (there are around 20M such strings to be converted).

I was wanting to wrap the libpandoc in Python, to call it directly without spawning a subprocess for Pandoc. Hence, I was trying to evaluate the speed of the conversion (hence the for loop), but my strings will not be all accessible at the same time.

I gave a try to your example but it does not work, I guess because once the reader returned 0 it will never be called again (your code would work if I could make a super string by concatenating all the strings I have to convert I guess).

ShabbyX commented 7 years ago

Initializing and cleaning up the Haskell runtime 20M times is definitely too much too, keep that in mind.

The problem you mentioned, with the reader never getting called again, sounds strange though. Do you perhaps mean that once your reader returns 0, it always returns 0 in the future calls? If so, then the problem is the fact that you rely on global variables and you don't reset them. Preferably, you would want to avoid global variables altogether and send a context.

pandoc_init();

for (int i = 0; i < 2; ++i)
{
    struct your_context context = {
        .src = your_str,
        .src_len = your_len,
        .dst = destination,
        .dst_len = length_to_avoid_overflow,
        .read_so_far = 0,
        .written_so_far = 0,
    };

    pandoc(BUFFER_LENGTH, "markdown", "plain", NULL, &reader, &writer, &context);
}

pandoc_exit();

You would need to modify the reader and writer functions to use the context. Essentially, they will keep their state in the context instead of global variables. The context is recreated or reset before each pandoc call (in this case, recreated), so the state is reset and the reader will give input to pandoc again.

In my example above, read_so_far tells how far the reader has consumed src. Once read_so_far reaches src_len, the reader is done and returns 0. After the call to pandoc returns, read_so_far is reset to 0 and the reader would produce data for the next pandoc call.

If this still doesn't work, then it's definitely a bug in libpandoc. After all, this is Haskell and the pandoc call should not be keeping any global state!

Phyks commented 7 years ago

Indeed, I should only init it once. I'll have a deeper look at your example, I guess this should be working.

By the way, concerning the original issue, I just came accross this:

hs_init() not allowed after hs_exit()

The FFI spec requires the implementation to support re-initialising itself after being shut down with hs_exit(), but GHC does not currently support that.

(from 14.1.1.8 https://downloads.haskell.org/~ghc/7.8.2/docs/html/users_guide/bugs-and-infelicities.html#infelicities-ffi)

EDIT: Ok, everything is working fine. Current code is:

#include <stdlib.h>
#include <string.h>

#include <pandoc.h>

int done = 0;
const char test[] = "$\\textit{K}$ -trivial, $\\textit{K}$ -low and ${{\\mathrm{\\textit{MLR}}}}$ -low Sequences: A Tutorial";
const int BUFFER_LENGTH = 1024;

int reader(char *buf, void *user_data) {
    if (done) {
        return 0;
    }
    strncpy(buf, test, BUFFER_LENGTH);
    done = 1;
    return strlen(test);
}

void writer(const char *buf, int len, void *user_data) {
    fwrite(buf, 1, len, stdout);
}

int main() {
    pandoc_init();

    for (int i = 0; i < 1000; ++i)
    {
        pandoc(BUFFER_LENGTH, "latex", "plain", NULL, &reader, &writer, NULL);
        printf("\n");
        done = 0;
        printf("%d\n", i);
    }

    pandoc_exit();
}

I should now test performances for my use case. Thanks a lot for the help! Mentioning the init / exit subtlety in the README would be really useful I guess.

EDIT2: Just gave a try, 100k conversions with the above script takes about 100s. So one should count on 1ms per conversion (for a string as in my previous example). Sounds reasonable :)

ShabbyX commented 7 years ago

@Phyks, I didn't see your edit, good to know you got it working. I also came up with an example (which uses a context instead of global variables):

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <pandoc.h>

#define BUF_SIZE 1024

struct data
{
    const char *src;
    size_t src_len;
    size_t src_cur;

    char *dst;
    size_t dst_len;
    size_t dst_cur;
};

int reader(char *buf, void *user_data)
{
    struct data *d = user_data;

    /* See how much is left to read */
    size_t to_write = d->src_len - d->src_cur;

    /* If finished, tell that to pandoc */
    if (to_write == 0)
        return 0;

    /* Make sure not to overflow the buffer */
    if (to_write > BUF_SIZE)
        to_write = BUF_SIZE;

    /* Copy this chunk of data */
    memcpy(buf, d->src + d->src_cur, to_write * sizeof *buf);
    d->src_cur += to_write;

    return to_write; 
}

void writer(const char *buf, int len, void *user_data)
{
    struct data *d = user_data;

    /* If current buffer is too small, increase its size */
    if (d->dst_cur + len > d->dst_len)
    {
        d->dst_len = d->dst_len == 0 ? 16 : d->dst_len * 2;
        if (d->dst_len < len)
            d->dst_len = len;

        char *new_dst = realloc(d->dst, d->dst_len * sizeof *d->dst);
        assert(new_dst != NULL);

        d->dst = new_dst;
    }

    /* Append what pandoc has produced */
    memcpy(d->dst + d->dst_cur, buf, len * sizeof *buf);
    d->dst_cur += len;
}

int main()
{
    const char *settings =
        "{\n"
            "\"writerOptions\": {\n"
                "\"writerStandalone\": false\n"
            "}\n"
        "}";
    const char *strings[] = {
        "Some _text_ here",
        "More stuff to **convert** with `pandoc`",
        "See [here](https://github.com/ShabbyX/libpandoc) for info",
    };
    const char *targets[] = {
        "html",
        "latex",
    };

    pandoc_init();

    for (size_t s = 0; s < sizeof strings / sizeof *strings; ++s)
    {
        printf("Converting: %s\n", strings[s]);

        for (size_t t = 0; t < sizeof targets / sizeof *targets; ++t)
        {
            struct data d = {
                .src = strings[s],
                .src_len = strlen(strings[s]),

                /* Note: everything else is 0 initalized */
            };

            printf(" To %s:\n", targets[t]);

            char *error = pandoc(BUF_SIZE,
                    "markdown",
                    targets[t],
                    settings,
                    reader,
                    writer,
                    &d);

            if (error)
                printf("  Error: %s\n", error);
            else
            {
                printf("  ");
                fwrite(d.dst, sizeof *d.dst, d.dst_cur, stdout);
                printf("\n");
            }

            free(d.dst);
            free(error);
        }

        printf("\n");
    }

    pandoc_exit();

    return 0;
}

The output is:

Converting: Some _text_ here
 To html:
  <p>Some <em>text</em> here</p>
 To latex:
  Some \emph{text} here

Converting: More stuff to **convert** with `pandoc`
 To html:
  <p>More stuff to <strong>convert</strong> with <code>pandoc</code></p>
 To latex:
  More stuff to \textbf{convert} with \texttt{pandoc}

Converting: See [here](https://github.com/ShabbyX/libpandoc) for info
 To html:
  <p>See <a href="https://github.com/ShabbyX/libpandoc">here</a> for info</p>
 To latex:
  See \href{https://github.com/ShabbyX/libpandoc}{here} for info

By the way, if you want to measure performance, make sure to exclude pandoc_init() and pandoc_exit() from it, as that's quite slow.

Phyks commented 7 years ago

Thanks! -- Phyks

Le 1 avril 2017 23:08:32 UTC+02:00, Shahbaz Youssefi notifications@github.com a écrit :

@Phyks, I didn't see your edit, good to know you got it working. I also came up with an example (which uses a context instead of global variables):

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <pandoc.h>

#define BUF_SIZE 1024

struct data
{
  const char *src;
  size_t src_len;
  size_t src_cur;

  char *dst;
  size_t dst_len;
  size_t dst_cur;
};

int reader(char *buf, void *user_data)
{
  struct data *d = user_data;

  /* See how much is left to read */
  size_t to_write = d->src_len - d->src_cur;

  /* If finished, tell that to pandoc */
  if (to_write == 0)
      return 0;

  /* Make sure not to overflow the buffer */
  if (to_write > BUF_SIZE)
      to_write = BUF_SIZE;

  /* Copy this chunk of data */
  memcpy(buf, d->src + d->src_cur, to_write * sizeof *buf);
  d->src_cur += to_write;

  return to_write; 
}

void writer(const char *buf, int len, void *user_data)
{
  struct data *d = user_data;

  /* If current buffer is too small, increase its size */
  if (d->dst_cur + len > d->dst_len)
  {
      d->dst_len = d->dst_len == 0 ? 16 : d->dst_len * 2;

      char *new_dst = realloc(d->dst, d->dst_len * sizeof *d->dst);
      assert(new_dst != NULL);

      d->dst = new_dst;
  }

  /* Append what pandoc has produced */
  memcpy(d->dst + d->dst_cur, buf, len * sizeof *buf);
  d->dst_cur += len;
}

int main()
{
  const char *settings =
      "{\n"
          "\"writerOptions\": {\n"
              "\"writerStandalone\": false\n"
          "}\n"
      "}";
  const char *strings[] = {
      "Some _text_ here",
      "More stuff to **convert** with `pandoc`",
      "See [here](https://github.com/ShabbyX/libpandoc) for info",
  };
  const char *targets[] = {
      "html",
      "latex",
  };

  pandoc_init();

  for (size_t s = 0; s < sizeof strings / sizeof *strings; ++s)
  {
      printf("Converting: %s\n", strings[s]);

      for (size_t t = 0; t < sizeof targets / sizeof *targets; ++t)
      {
          struct data d = {
              .src = strings[s],
              .src_len = strlen(strings[s]),

              /* Note: everything else is 0 initalized */
          };

          printf(" To %s:\n", targets[t]);

          char *error = pandoc(BUF_SIZE,
                  "markdown",
                  targets[t],
                  settings,
                  reader,
                  writer,
                  &d);

          if (error)
              printf("  Error: %s\n", error);
          else
          {
              printf("  ");
              fwrite(d.dst, sizeof *d.dst, d.dst_cur, stdout);
              printf("\n");
          }

          free(error);
      }

      printf("\n");
  }

  pandoc_exit();

  return 0;
}

By the way, if you want to measure performance, make sure to exclude pandoc_init() and pandoc_exit() from it, as that's quite slow.