Closed Phyks closed 7 years ago
I would have expected this to work as well! I'll give it a try in the weekend and let you know.
In the meantime, why not do it like this?
pandoc_init();
for (int i = 0; i < 2; ++i)
pandoc(BUFFER_LENGTH, "markdown", "plain", NULL, &reader, &writer, NULL);
pandoc_exit();
There shouldn't be any reason why you would need to re-initialize the Haskell runtime.
Actually, the problem is a bit more subtle than this :)
I am willing to use Pandoc to convert LaTeX output to plain text. The overhead of calling Pandoc through a subprocess for every string to be converted is too much (there are around 20M such strings to be converted).
I was wanting to wrap the libpandoc
in Python, to call it directly without spawning a subprocess for Pandoc. Hence, I was trying to evaluate the speed of the conversion (hence the for
loop), but my strings will not be all accessible at the same time.
I gave a try to your example but it does not work, I guess because once the reader
returned 0
it will never be called again (your code would work if I could make a super string by concatenating all the strings I have to convert I guess).
Initializing and cleaning up the Haskell runtime 20M times is definitely too much too, keep that in mind.
The problem you mentioned, with the reader
never getting called again, sounds strange though. Do you perhaps mean that once your reader
returns 0, it always returns 0 in the future calls? If so, then the problem is the fact that you rely on global variables and you don't reset them. Preferably, you would want to avoid global variables altogether and send a context.
pandoc_init();
for (int i = 0; i < 2; ++i)
{
struct your_context context = {
.src = your_str,
.src_len = your_len,
.dst = destination,
.dst_len = length_to_avoid_overflow,
.read_so_far = 0,
.written_so_far = 0,
};
pandoc(BUFFER_LENGTH, "markdown", "plain", NULL, &reader, &writer, &context);
}
pandoc_exit();
You would need to modify the reader
and writer
functions to use the context. Essentially, they will keep their state in the context instead of global variables. The context is recreated or reset before each pandoc
call (in this case, recreated), so the state is reset and the reader
will give input to pandoc again.
In my example above, read_so_far
tells how far the reader
has consumed src
. Once read_so_far
reaches src_len
, the reader
is done and returns 0. After the call to pandoc
returns, read_so_far
is reset to 0 and the reader
would produce data for the next pandoc
call.
If this still doesn't work, then it's definitely a bug in libpandoc. After all, this is Haskell and the pandoc
call should not be keeping any global state!
Indeed, I should only init it once. I'll have a deeper look at your example, I guess this should be working.
By the way, concerning the original issue, I just came accross this:
hs_init() not allowed after hs_exit()
The FFI spec requires the implementation to support re-initialising itself after being shut down with hs_exit(), but GHC does not currently support that.
(from 14.1.1.8 https://downloads.haskell.org/~ghc/7.8.2/docs/html/users_guide/bugs-and-infelicities.html#infelicities-ffi)
EDIT: Ok, everything is working fine. Current code is:
#include <stdlib.h>
#include <string.h>
#include <pandoc.h>
int done = 0;
const char test[] = "$\\textit{K}$ -trivial, $\\textit{K}$ -low and ${{\\mathrm{\\textit{MLR}}}}$ -low Sequences: A Tutorial";
const int BUFFER_LENGTH = 1024;
int reader(char *buf, void *user_data) {
if (done) {
return 0;
}
strncpy(buf, test, BUFFER_LENGTH);
done = 1;
return strlen(test);
}
void writer(const char *buf, int len, void *user_data) {
fwrite(buf, 1, len, stdout);
}
int main() {
pandoc_init();
for (int i = 0; i < 1000; ++i)
{
pandoc(BUFFER_LENGTH, "latex", "plain", NULL, &reader, &writer, NULL);
printf("\n");
done = 0;
printf("%d\n", i);
}
pandoc_exit();
}
I should now test performances for my use case. Thanks a lot for the help! Mentioning the init
/ exit
subtlety in the README
would be really useful I guess.
EDIT2: Just gave a try, 100k conversions with the above script takes about 100s. So one should count on 1ms per conversion (for a string as in my previous example). Sounds reasonable :)
@Phyks, I didn't see your edit, good to know you got it working. I also came up with an example (which uses a context instead of global variables):
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <pandoc.h>
#define BUF_SIZE 1024
struct data
{
const char *src;
size_t src_len;
size_t src_cur;
char *dst;
size_t dst_len;
size_t dst_cur;
};
int reader(char *buf, void *user_data)
{
struct data *d = user_data;
/* See how much is left to read */
size_t to_write = d->src_len - d->src_cur;
/* If finished, tell that to pandoc */
if (to_write == 0)
return 0;
/* Make sure not to overflow the buffer */
if (to_write > BUF_SIZE)
to_write = BUF_SIZE;
/* Copy this chunk of data */
memcpy(buf, d->src + d->src_cur, to_write * sizeof *buf);
d->src_cur += to_write;
return to_write;
}
void writer(const char *buf, int len, void *user_data)
{
struct data *d = user_data;
/* If current buffer is too small, increase its size */
if (d->dst_cur + len > d->dst_len)
{
d->dst_len = d->dst_len == 0 ? 16 : d->dst_len * 2;
if (d->dst_len < len)
d->dst_len = len;
char *new_dst = realloc(d->dst, d->dst_len * sizeof *d->dst);
assert(new_dst != NULL);
d->dst = new_dst;
}
/* Append what pandoc has produced */
memcpy(d->dst + d->dst_cur, buf, len * sizeof *buf);
d->dst_cur += len;
}
int main()
{
const char *settings =
"{\n"
"\"writerOptions\": {\n"
"\"writerStandalone\": false\n"
"}\n"
"}";
const char *strings[] = {
"Some _text_ here",
"More stuff to **convert** with `pandoc`",
"See [here](https://github.com/ShabbyX/libpandoc) for info",
};
const char *targets[] = {
"html",
"latex",
};
pandoc_init();
for (size_t s = 0; s < sizeof strings / sizeof *strings; ++s)
{
printf("Converting: %s\n", strings[s]);
for (size_t t = 0; t < sizeof targets / sizeof *targets; ++t)
{
struct data d = {
.src = strings[s],
.src_len = strlen(strings[s]),
/* Note: everything else is 0 initalized */
};
printf(" To %s:\n", targets[t]);
char *error = pandoc(BUF_SIZE,
"markdown",
targets[t],
settings,
reader,
writer,
&d);
if (error)
printf(" Error: %s\n", error);
else
{
printf(" ");
fwrite(d.dst, sizeof *d.dst, d.dst_cur, stdout);
printf("\n");
}
free(d.dst);
free(error);
}
printf("\n");
}
pandoc_exit();
return 0;
}
The output is:
Converting: Some _text_ here
To html:
<p>Some <em>text</em> here</p>
To latex:
Some \emph{text} here
Converting: More stuff to **convert** with `pandoc`
To html:
<p>More stuff to <strong>convert</strong> with <code>pandoc</code></p>
To latex:
More stuff to \textbf{convert} with \texttt{pandoc}
Converting: See [here](https://github.com/ShabbyX/libpandoc) for info
To html:
<p>See <a href="https://github.com/ShabbyX/libpandoc">here</a> for info</p>
To latex:
See \href{https://github.com/ShabbyX/libpandoc}{here} for info
By the way, if you want to measure performance, make sure to exclude pandoc_init()
and pandoc_exit()
from it, as that's quite slow.
Thanks! -- Phyks
Le 1 avril 2017 23:08:32 UTC+02:00, Shahbaz Youssefi notifications@github.com a écrit :
@Phyks, I didn't see your edit, good to know you got it working. I also came up with an example (which uses a context instead of global variables):
#include <stdio.h> #include <string.h> #include <stdlib.h> #include <assert.h> #include <pandoc.h> #define BUF_SIZE 1024 struct data { const char *src; size_t src_len; size_t src_cur; char *dst; size_t dst_len; size_t dst_cur; }; int reader(char *buf, void *user_data) { struct data *d = user_data; /* See how much is left to read */ size_t to_write = d->src_len - d->src_cur; /* If finished, tell that to pandoc */ if (to_write == 0) return 0; /* Make sure not to overflow the buffer */ if (to_write > BUF_SIZE) to_write = BUF_SIZE; /* Copy this chunk of data */ memcpy(buf, d->src + d->src_cur, to_write * sizeof *buf); d->src_cur += to_write; return to_write; } void writer(const char *buf, int len, void *user_data) { struct data *d = user_data; /* If current buffer is too small, increase its size */ if (d->dst_cur + len > d->dst_len) { d->dst_len = d->dst_len == 0 ? 16 : d->dst_len * 2; char *new_dst = realloc(d->dst, d->dst_len * sizeof *d->dst); assert(new_dst != NULL); d->dst = new_dst; } /* Append what pandoc has produced */ memcpy(d->dst + d->dst_cur, buf, len * sizeof *buf); d->dst_cur += len; } int main() { const char *settings = "{\n" "\"writerOptions\": {\n" "\"writerStandalone\": false\n" "}\n" "}"; const char *strings[] = { "Some _text_ here", "More stuff to **convert** with `pandoc`", "See [here](https://github.com/ShabbyX/libpandoc) for info", }; const char *targets[] = { "html", "latex", }; pandoc_init(); for (size_t s = 0; s < sizeof strings / sizeof *strings; ++s) { printf("Converting: %s\n", strings[s]); for (size_t t = 0; t < sizeof targets / sizeof *targets; ++t) { struct data d = { .src = strings[s], .src_len = strlen(strings[s]), /* Note: everything else is 0 initalized */ }; printf(" To %s:\n", targets[t]); char *error = pandoc(BUF_SIZE, "markdown", targets[t], settings, reader, writer, &d); if (error) printf(" Error: %s\n", error); else { printf(" "); fwrite(d.dst, sizeof *d.dst, d.dst_cur, stdout); printf("\n"); } free(error); } printf("\n"); } pandoc_exit(); return 0; }
By the way, if you want to measure performance, make sure to exclude
pandoc_init()
andpandoc_exit()
from it, as that's quite slow.
Hi,
I may have missed something, but I would have expected something like this to be working
That is, to convert multiple strings, I could just
pandoc_init
, convert, andpandoc_exit
and then do all of this again.But it does not seem to work, and I get a segfault. I am not sure exactly why this is happening, it seems to be coming from the haskell platform.
Thanks!