Why fork guile-json? - Githubissues

rekado commented 4 years ago

I see that in the Dockerfile a forked variant of guile-json is installed. Why is the fork necessary? I see that in the fork a bunch of fields are made mutable. Mutation is pretty rare in idiomatic Scheme code, so I wonder why this is needed.

Since the fork is pretty young it would be great if we could undo it, perhaps by avoiding mutation in annotation-scheme.

What do you think?

Habush commented 4 years ago

Mutation is pretty rare in idiomatic Scheme code, so I wonder why this is needed.

It fork is needed to add custom records for parsing the scheme result from the annotation functions into json. It is mainly used in parser.scm. Please take a look if you have any suggestions, I can send a PR based on your suggestions

rekado commented 4 years ago

I wonder if we could forgo the parser and use read instead to read the atomese expressions into S-expressions directly.

Habush commented 4 years ago

I wonder if we could forgo the parser and use read instead to read the atomese expressions into S-expressions directly.

@linas do you think this is feasible?

linas commented 4 years ago

do you think this is feasible?

I do not understand what is being discussed here. I have not yet seen anything in the code that reads, writes or uses json in any way; so I can't comment on the json library. (I made a big mistake the other night; I thought the call/cc stuff was calling into json but this seems not to be the case.)

rekado commented 4 years ago

The atomese parser in parser.scm appears to do a lot of busy work that seems to be unnecessary. It attempts to build JSON directly from a string representation of atomese. That seems a bit convoluted because AFAIU atomese is in S-expression format (judging by the looks of the parser) --- so we can simply use Guile's read to turn an atomese string into a proper S-expression.

Now if the preferred format is JSON that's fine --- Guile JSON provides procedures to convert S-expressions to JSON. By replacing the parser with a mere invocation of read, followed by a conversion of the S-expr to JSON via the vanilla Guile JSON the custom fork of Guile JSON could simply be dropped, making the installation of annotation-scheme simpler and ridding the world of an unnecessary fork of an upstream library.

@Habush is there a need to use JSON internally? Or could annotation-scheme simply operate on S-expressions and only use JSON for external representations when requested (e.g. by a web server)?

rekado commented 4 years ago

After looking at the diff I think we can make this work without a parser and without a fork of Guile JSON. Nothing about the graph data structure that's used in the fork of Guile JSON really depends on JSON, so the graph records might as well be built from S-expression inputs.

My recommendation for a first step would be to move the graph records from the Guile JSON fork to annotation-scheme and to make them process and return S-expressions, i.e. json-build-object needn't be used.

I haven't really looked for places where JSON is used in annotation-scheme, but it would be trivial to replace all users of JSON in this code base with users of S-expressions and to provide S-expr-->JSON procedures where needed.

@Habush This seems like a fun simplification. I can't promise anything, but if I can free some time I'd be happy to make these changes.

linas commented 4 years ago

@rekado The following comment may or may not be relevant and maybe you already know this, but, just to clarify things a bit:

"Atomese" aren't just "s-expressions"; each BlahNode or FooLink is typically backed by a C++ class, usually with a non-trivial constructor that can be cpu-hungry (and RAM-hungry, too). If the goal is to merely convert s-expressions into json, then it would be better to do this treating everything as strings, rather than actually loading (i.e. executing) the s-expr. This is always possible because Atomese is side-effect free, there's no hidden state to trip you up, so you can treat it as a string and do rewrites on that string.
- Here's an example of how cpu/ram-hungry the atoms are: on my machine, the gene-database files, consisting of pure Atomese s-exprs, and no guile at all, load in approx 750 seconds (single-threaded), create 6.8 million atoms, for a rate of about 9K Atoms/sec. So, this is a significant. overhead. I was about ready to type "almost all of this time is lost in the C++ code, and not guile", but I just now looked at (gc-stats) and see more than 600 seconds in gc, so there is a small wtf that I do not currently understand, and needs investigation.
- Yikes! If I am not wrong/confused about this gc number, then it might be faster to read in the Atomese s-exprs as a string, (in C++) and then looping, in C++, to create the atoms, completely skipping guile. That is, currently, in the normal course of events, for every atomese atom, a guile smob is created; but then, shortly after, that smob is gc'ed. Perhaps this is a large source of overhead in file-loading.
- The point here being that each atom is an entry in a graph db, so there is no need to hold on to the corresponding smob. It can always be trivially/easily recreated from the graph db. The smobs are there only so that scheme code can manipulate specific atoms in specific ways, as needed. But otherwise, those smobs can be safely gc'ed as needed; the corresponding entry in the graph db is NOT erased.
- Thus, typically (i.e. in other apps), the graph DB can be many gigabytes in size, hold tens or hundreds of millions of atoms, while at the same time, the guile heap is some handful of megabytes.

I hope the above made sense to you, let me know if something was unclear.

linas commented 4 years ago

Again -- perhaps this is fairly off-topic, perhaps its relevant -- but I just re-measured file loading. I have 15 files, from the mozi wget location, each one consisting of "pure" atomese s-expressions. I am loading these simply by saying (load "/path/to/file/foo.scm") and it takes 831 seconds to load these (single-threaded, on a system that is also busy doing other things). Upon finishing loading, (gc-stats) reports:

$2 = ((gc-time-taken . 500744061959) (heap-size . 7921664) (heap-free-size . 565248) (heap-total-allocated . 52105911440) (heap-allocated-since-gc . 2132608) (protected-objects . 1) (gc-times . 18151))

So if I read this correctly, there were 500 seconds spent in gc. The heap is modest: 8MB, but a total of 52GB of heap was swept through! Wow! The gc had to run more than twice a second to keep up.

What's happening here is that during the file load, lots of SCM strings and symbols are getting created, as well as SCM smobs; these are being used to feed the graph data into the C++ code, and then these are being gc'ed shortly thereafter. Appearntly, 500/831 = 60% of the time is spent in GC. Also, since 6.8M atoms are created, I get 52105911440/6757334 = 7.8KBytes of stuff gc'ed, per atom created. That's really quite hefty. Every atom requires 500/6757334 = 74 usec spent in gc.

If the goal is faster file-loading, then it would be effective to write a C++ wrapper, to read the file as a string, walk the string as an ordinary s-expr, and then directly create the corresponding atoms, in C++, without ever-once touching guile. (Once loaded, these are, of course, as always, accessible to guile) We could call this utility (load-atomese) or something like that, and if cleanly written, it would be a suitable candidate for inclusion in the AtomSpace git repo.

rekado commented 4 years ago

If the goal is to merely convert s-expressions into json, then it would be better to do this treating everything as strings, rather than actually loading (i.e. executing) the s-expr. This is always possible because Atomese is side-effect free, there's no hidden state to trip you up, so you can treat it as a string and do rewrites on that string.

Yes, I understand. Currently, the S-expression isn't executed but read in a round-about way via a parser. I'm not suggesting to evaluate Atomese but to read the expression as data using read. It's a lost opportunity to read something that has an S-expression representation as a plain string, when we could instead operate at a higher level.

rekado commented 4 years ago

This has been solved with commit https://github.com/MOZI-AI/annotation-scheme/pull/121/commits/40194a0cd1013398d027fd17bb106a1fb41172a0 , which removes the need for the Gulie JSON fork.

MOZI-AI / annotation-scheme

Why fork guile-json? #93