dajobe / raptor

Redland Raptor RDF syntax library
https://librdf.org/raptor/
Other
156 stars 62 forks source link

Provide turtle chunk parser #26

Closed hroptatyr closed 9 years ago

hroptatyr commented 9 years ago
This changeset allows to parse huge turtle, trig and n3 files.
Huge hereby means file sizes bigger than the main memory.
It has been tested on the GND dataset of the Deutsche
Nationalbibliothek (121 MTriples), the dbpedia dataset (583 MT)
and a private production dataset (1225 MT).

Previously, the turtle parser tried to stack up all input in a huge
buffer which it then proceeded to process at once.

This changeset introduces a parser that attempts to parse each given
chunk immediately.  Syntax errors that arise due to end-of-buffer
situations in the middle of a grammar rule are accounted for by
resolving statements using the special `error' rule accompanied with
error recovery that copies over the remainder of the buffer to the
beginning so it can be appended by the next chunk.

Full turtle statements (the ones ending in DOT) will never be part
of the remainder.  However, because of blank nodes and collections
statements can't be issued immediately anymore, instead the concept
of deferring the emission of a statement is introduced.  This is to
avoid dangling (bnodeid) statements in case a turtle SPO statement
isn't DOT ended yet but the blank node property list or collection
has been read already.

* struct raptor_turtle_parser_s: introduce slots for buffer book keeping
* turtle_lexer.l: use YY_USER_ACTION to keep track of buffer consumption
* turtle_parser.y
  raptor_turtle_generate_statement(): split in two, see following
  raptor_turtle_clone_statement(): prepare statement for handling
  raptor_turtle_handle_statement(): call a parser's statement handler
  raptor_turtle_defer_statement(): like raptor_turtle_generate_statement()
    but instead of calling the statement handler immediately put
    it on a list of deferred statements, called (handled) only if
    the statement rule path has been taken (triples DOT)
  raptor_turtle_parse_chunk(): begin parsing on chunks for every
    call, only stack up things in buffers if the remainder of a
    chunk has been resolved through the `error' rule.
artob commented 9 years ago

:+1: Nice job. Will try and make time to test this on some of our larger datasets.

dajobe commented 9 years ago

Thanks a lot! I've been trying to figure out how to do this for some time. Will check it out!

dajobe commented 9 years ago

Thanks for the update. I did some basic tests but I'm curious how it deals with a large file that ends in \r or no final newline. I'm not clear on the termination logic.

dajobe commented 9 years ago

OK, looks good to me. I just want to check it myself on a large (>memory, >4G at least) file

dajobe commented 9 years ago

I checked it with a large file, looks great. Thanks for the contribution.

artob commented 9 years ago

For the changelog for the next release, it sounds as though this patch resolves issue 0000512.

AlexanderWillner commented 8 years ago

Nice. When is the next release planned that includes this patch?

dajobe commented 8 years ago

@AlexanderWillner I'll try to get it out before the end of this year