Closed TBoonX closed 1 year ago
Can you run
rpt sansa analyze csv your-data.csv --out-file report.ttl
(See also https://sansa-stack.github.io/SANSA-Stack/cli/tarql.html#inspecting-csv-files)
and check whether it can parse the CSV file correctly? It should output an RDF document with parsing information about each split of the CSV file.
Output:
@prefix eg: <http://www.example.org/> .
@prefix xds: <http://www.w3.org/2001/XMLSchema#> .
_:b0 eg:totalDuration "0.012"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b1 eg:regionEndProbeResult _:b0 ;
eg:totalElementCount "860636"^^xds:long ;
eg:totalBytesRead "33554513"^^xds:long ;
eg:totalTime "0.7905820520000001"^^xds:double ;
eg:splitStart "0"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:regionStartProbeResult _:b2 ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalRecordCount "860637"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false .
_:b2 eg:totalDuration "0.0"^^xds:double ;
eg:probeCount "0"^^xds:long ;
eg:candidatePos "0"^^xds:long .
_:b3 eg:regionEndProbeResult _:b4 ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:tailElementCount "1"^^xds:int ;
eg:totalBytesRead "33554469"^^xds:long ;
eg:splitStart "33554432"^^xds:long ;
eg:totalRecordCount "816606"^^xds:long ;
eg:totalElementCount "816606"^^xds:long ;
eg:totalTime "0.41145543700000003"^^xds:double ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartProbeResult _:b5 .
_:b4 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "37"^^xds:long .
_:b5 eg:totalDuration "0.007"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b6 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "45"^^xds:long .
_:b7 eg:totalDuration "0.008"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "37"^^xds:long .
_:b8 eg:totalTime "0.423793693"^^xds:double ;
eg:totalRecordCount "799251"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:totalElementCount "799251"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:splitStart "67108864"^^xds:long ;
eg:totalBytesRead "33554477"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:regionStartProbeResult _:b7 ;
eg:regionEndProbeResult _:b6 .
_:b9 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "45"^^xds:long .
_:b10 eg:regionStartProbeResult _:b9 ;
eg:totalTime "0.34524198"^^xds:double ;
eg:tailElementCount "1"^^xds:int ;
eg:totalRecordCount "810578"^^xds:long ;
eg:totalElementCount "810578"^^xds:long ;
eg:splitStart "100663296"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:splitSize "33554432"^^xds:long ;
eg:totalBytesRead "33554513"^^xds:long ;
eg:regionEndProbeResult _:b11 ;
eg:regionStartSearchReadOverRegionEnd false .
_:b11 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b12 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b13 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b14 eg:splitStart "134217728"^^xds:long ;
eg:totalBytesRead "33554513"^^xds:long ;
eg:regionEndProbeResult _:b12 ;
eg:totalTime "0.33095979400000003"^^xds:double ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalRecordCount "803622"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:totalElementCount "803622"^^xds:long ;
eg:regionStartProbeResult _:b13 ;
eg:regionStartSearchReadOverRegionEnd false .
_:b15 eg:splitStart "167772160"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartProbeResult _:b16 ;
eg:regionEndProbeResult _:b17 ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:totalTime "0.35778299"^^xds:double ;
eg:totalRecordCount "804767"^^xds:long ;
eg:totalBytesRead "33554497"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "804767"^^xds:long .
_:b16 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "81"^^xds:long .
_:b17 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "65"^^xds:long .
_:b18 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "65"^^xds:long .
_:b19 eg:regionStartSearchReadOverSplitEnd false ;
eg:splitStart "201326592"^^xds:long ;
eg:totalBytesRead "33554507"^^xds:long ;
eg:regionEndProbeResult _:b20 ;
eg:splitSize "33554432"^^xds:long ;
eg:totalRecordCount "812414"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "812414"^^xds:long ;
eg:totalTime "0.334013382"^^xds:double ;
eg:regionStartProbeResult _:b18 .
_:b20 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "75"^^xds:long .
_:b21 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "82"^^xds:long .
_:b22 eg:totalRecordCount "809813"^^xds:long ;
eg:splitStart "234881024"^^xds:long ;
eg:regionStartProbeResult _:b23 ;
eg:totalBytesRead "33554514"^^xds:long ;
eg:totalTime "0.361883591"^^xds:double ;
eg:regionEndProbeResult _:b21 ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "809813"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:splitSize "33554432"^^xds:long ;
eg:tailElementCount "1"^^xds:int .
_:b23 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "75"^^xds:long .
_:b24 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "82"^^xds:long .
_:b25 eg:totalElementCount "814884"^^xds:long ;
eg:splitStart "268435456"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:tailElementCount "1"^^xds:int ;
eg:totalTime "0.34096873"^^xds:double ;
eg:totalBytesRead "33554494"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartProbeResult _:b24 ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalRecordCount "814884"^^xds:long ;
eg:regionEndProbeResult _:b26 .
_:b26 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b27 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "70"^^xds:long .
_:b28 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b29 eg:regionStartProbeResult _:b28 ;
eg:splitStart "301989888"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "812127"^^xds:long ;
eg:totalRecordCount "812127"^^xds:long ;
eg:totalTime "0.34634130300000004"^^xds:double ;
eg:totalBytesRead "33554502"^^xds:long ;
eg:regionEndProbeResult _:b27 .
_:b30 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "70"^^xds:long .
_:b31 eg:regionStartSearchReadOverRegionEnd false ;
eg:tailElementCount "1"^^xds:int ;
eg:totalTime "0.33287019900000003"^^xds:double ;
eg:regionStartProbeResult _:b30 ;
eg:splitStart "335544320"^^xds:long ;
eg:totalRecordCount "809327"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:totalBytesRead "33554491"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalElementCount "809327"^^xds:long ;
eg:regionEndProbeResult _:b32 .
_:b32 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "59"^^xds:long .
_:b33 eg:totalTime "0.345964153"^^xds:double ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalElementCount "811930"^^xds:long ;
eg:regionEndProbeResult _:b34 ;
eg:tailElementCount "1"^^xds:int ;
eg:splitStart "369098752"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:totalBytesRead "33554494"^^xds:long ;
eg:splitSize "33554432"^^xds:long ;
eg:totalRecordCount "811930"^^xds:long ;
eg:regionStartProbeResult _:b35 .
_:b35 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "59"^^xds:long .
_:b34 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b36 eg:totalTime "0.32924473800000004"^^xds:double ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionEndProbeResult _:b37 ;
eg:tailElementCount "1"^^xds:int ;
eg:splitSize "33554432"^^xds:long ;
eg:regionStartProbeResult _:b38 ;
eg:totalElementCount "797831"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalRecordCount "797831"^^xds:long ;
eg:totalBytesRead "33554493"^^xds:long ;
eg:splitStart "402653184"^^xds:long .
_:b37 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "61"^^xds:long .
_:b38 eg:totalDuration "0.004"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "62"^^xds:long .
_:b39 eg:totalDuration "0.061000000000000006"^^xds:double ;
eg:probeCount "0"^^xds:long ;
eg:candidatePos "-1"^^xds:long .
_:b40 eg:splitStart "436207616"^^xds:long ;
eg:totalElementCount "570104"^^xds:long ;
eg:regionStartSearchReadOverRegionEnd false ;
eg:totalRecordCount "570103"^^xds:long ;
eg:regionStartSearchReadOverSplitEnd false ;
eg:regionStartProbeResult _:b41 ;
eg:splitSize "24421635"^^xds:long ;
eg:totalBytesRead "24421635"^^xds:long ;
eg:regionEndProbeResult _:b39 ;
eg:tailElementCount "0"^^xds:int ;
eg:totalTime "0.30285072300000004"^^xds:double .
_:b41 eg:totalDuration "0.005"^^xds:double ;
eg:probeCount "1"^^xds:long ;
eg:candidatePos "61"^^xds:long .
Hm, so the CSV parsing looks ok - what happens when you increase the kryo size?
EXTRA_OPTS="-Dspark.kryoserializer.buffer.max=2000000000" rpt
java -D "-Dspark.kryoserializer.buffer.max=2000000000" -jar
Correction:
java -Dspark.kryoserializer.buffer.max=2048 -jar ...
Maximum size of the buffer is <2048MB and the -D parameter has to be used differently.
Does increasing kryo buffer size have any effect? Since the CSV parsing seems to work, it would indicate that a single partition of CSV data maps to a very large amounts of RDF data (maybe the mapping produces many duplicates?).
A mapping where several thousands of triples are attached to the same subject (e.g. due to incorrect mapping) might also cause this issue - it was somehow related to very large turtle blocks being formed which exceed internal thresholds.
Maybe switching to ntriples serialization makes the issue go away?
It works with the buffer parameter, thanks!
Updated Sansa CLI to use kryo's max buffer size of 2048 by default. It is possible to override it to make it smaller, but not sure if there is a good reason to do so.
https://github.com/SANSA-Stack/SANSA-Stack/commit/49483992c978f0f44d777abca91aae3dc2167103 (I realize I should have created a separate issue at sansa but oh well)
I started with:
sansa query mapping.rq
The mapping file is simple but the linked csv is simple and long (440MB). The first minutes I had jobs using all my cores, then up to two jobs just used my cores for a while and then after 10 minutes or so the error was thrown. Here is the end of the log:
I have 64GB RAM Which were at 60% usage and 16 logical cores.