Closed VladimirAlexiev closed 4 years ago
I changed the logs so that there will be only generate something after 10k triples.
Thanks! I see it for customer_datasets_stats.csv.
However, stdout+stderr can be improved:
Eg
Row 10000, PK <whatever1>, TM <person/(customer_id)!map>, total 21000 triples
Row 20000, PK <whatever2>, TM <person/(customer_id)!map>, total 42000 triples
...
Row 10000, PK <whatever1>, TM <person/(customer_id)/birth!map>, total 10020000 triples
Row 20000, PK <whatever2>, TM <person/(customer_id)/birth!map>, total 10040000 triples
@VladimirAlexiev these statistics are provided in this form because are using to calculate dief@k and dief@t: https://github.com/maribelacosta/dief
I'm converting a moderate table of 1.4M rows, 33 fields (was 280Mb CSV). It produces these files:
So the logs are about 30% of the output. I expect this to produce a comparable slow-down. The stats file is especially wasteful, it prints one line per triple (or maybe per subject map):
Some progress indication is appreciated, but please print something at every 10k rows, not every triple