Open vikramsubramanian opened 4 months ago
Summary: Infinite loop occurs when reading Turtle files with malformed last lines, causing the COPY FROM
statement to run into an infinite loop.
To address the infinite loop issue when importing Turtle files with malformed last lines using COPY FROM
, follow these steps:
readChunk
function in rdf_reader.cpp
within the rdf-timer
branch.readChunk
function properly checks for end-of-file (EOF) conditions and does not rely on reading a specific terminating character or line that may be missing in malformed files.readChunk
function, add a condition to break out of the loop if the end of the file is reached or if the Serd parser returns an error status indicating that it cannot proceed further.rdf_reader.cpp
for proper error handling. The Serd parser should signal an error or EOF status that can be used to exit the loop.serd_reader_end_stream
is called after the loop to properly clean up and signal the end of the stream.third_party/serd/src/serdi.c
and third_party/serd/src/n3.c
files to ensure they correctly handle malformed input and EOF.COPY FROM
command with both latest-lexemes-400KB.ttl
and latest-lexemes-400KB-last-line-removed.ttl
to confirm that the infinite loop issue is resolved.Remember to test the changes thoroughly with various Turtle files, both well-formed and malformed, to ensure that the COPY FROM
command behaves correctly in all scenarios.
This file contains the logic for reading and parsing Turtle files, which is directly related to the issue of an infinite loop when reading malformed Turtle files.
This file contains functions for reading Turtle and Trig documents, which may contain the loop causing the infinite read issue.
I was playing around with generating different subsets of a Turtle file. I was generating the files with the following
head
command:I cannot import any of the files I generate this way. The
COPY FROM
statement seems to be running into an infinite loop. Here is how you can reproduce this.mv latest-lexemes-400KB.ttl.txt latest-lexemes-400KB.ttl
.create rdfgraph testbug;
copy testbug from "latest-lexemes-400KB.ttl" (in_memory=true);
.I say this is an infinite loop because on the
rdf-timer
branch, where there is some progress report printing, i see that some of the lines print continuously and keeps reading 100s of millions of triples on something with only thousands of lines. So the handles given to Serd parser functions seem to be getting called over and over again or maybe the readChunk code in rdf_reader.cpp keeps trying to pull data over and over again. I don't understand the logic.If you emacs the file and remove the last chunk, which I am attaching as another file, things work fine. What is surprising is that I tried to produce a minimal example that reproduces the bug but I could not. For example, I tried this small example, which copy pastes the last chunk of the latest-lexemes-400KB.ttl.txt file to a simple example in our documentation, but I can't make
COPY FROM
get into an infinite loop.It may be something unrelated to the last line or that I am not able to copy paste exactly the same characters somehow.
[latest-lexemes-400KB.ttl.txt]( [latest-lexemes-400KB-last-line-removed.ttl.txt](
)