Mayil-AI-Sandbox / kuzudb_jan15

MIT License
0 stars 0 forks source link

Infinite loop when reading Turtle files with malformed last lines (hashtag2821) #14

Open vikramsubramanian opened 4 months ago

vikramsubramanian commented 4 months ago

I was playing around with generating different subsets of a Turtle file. I was generating the files with the following head command:

 head -c 400000 latest-lexemes.ttl > latest-lexemes-400KB.ttl

I cannot import any of the files I generate this way. The COPY FROM statement seems to be running into an infinite loop. Here is how you can reproduce this.

  1. Download the attached latest-lexemes-400KB.ttl.txt file and do mv latest-lexemes-400KB.ttl.txt latest-lexemes-400KB.ttl.
  2. Go to Kuzu cli and type: create rdfgraph testbug; copy testbug from "latest-lexemes-400KB.ttl" (in_memory=true);.

I say this is an infinite loop because on the rdf-timer branch, where there is some progress report printing, i see that some of the lines print continuously and keeps reading 100s of millions of triples on something with only thousands of lines. So the handles given to Serd parser functions seem to be getting called over and over again or maybe the readChunk code in rdf_reader.cpp keeps trying to pull data over and over again. I don't understand the logic.

If you emacs the file and remove the last chunk, which I am attaching as another file, things work fine. What is surprising is that I tried to produce a minimal example that reproduces the bug but I could not. For example, I tried this small example, which copy pastes the last chunk of the latest-lexemes-400KB.ttl.txt file to a simple example in our documentation, but I can't make COPY FROM get into an infinite loop.

 kz: < .
PREFIX rdf: < .
PREFIX rdfs: < .

kz:Waterloo a kz:City ;
            kz:name "Waterloo" ;
            kz:population 150000 .

kz:Adam a kz:student ;
        kz:livesIn kz:Waterloo ;
        kz:name "Adam" ;
        kz:age  30 .

kz:student rdfs:subClassOf kz:person .

kz:Karissa a kz:student ;
           kz:bornIn kz:Waterloo ;
           kz:name "Karissa" .

kz:Zhang a kz:faculty ;
         kz:name "Zhang" .

kz:L504-F28 a kz:Form ;
        kz:P443 <

It may be something unrelated to the last line or that I am not able to copy paste exactly the same characters somehow.

[latest-lexemes-400KB.ttl.txt]( [latest-lexemes-400KB-last-line-removed.ttl.txt](

)

mayil-ai[bot] commented 4 months ago

Summary: Infinite loop occurs when reading Turtle files with malformed last lines, causing the COPY FROM statement to run into an infinite loop.

Possible Solution

To address the infinite loop issue when importing Turtle files with malformed last lines using COPY FROM, follow these steps:

Remember to test the changes thoroughly with various Turtle files, both well-formed and malformed, to ensure that the COPY FROM command behaves correctly in all scenarios.

Code snippets to check