SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
108 stars 25 forks source link

control logging #23

Closed VladimirAlexiev closed 4 years ago

VladimirAlexiev commented 4 years ago

I'm converting a moderate table of 1.4M rows, 33 fields (was 280Mb CSV). It produces these files:

So the logs are about 30% of the output. I expect this to produce a comparable slow-down. The stats file is especially wasteful, it prints one line per triple (or maybe per subject map):

"Dataset","Number of the triple","Time"
"customer","1","16.364003896713257"
"customer","2","16.364003896713257"
"customer","3","16.365002632141113"
"customer","4","16.365002632141113"
"customer","5","16.365999460220337"
"customer","6","16.3670015335083"
"customer","7","16.368016719818115"

Some progress indication is appreciated, but please print something at every 10k rows, not every triple

eiglesias34 commented 4 years ago

I changed the logs so that there will be only generate something after 10k triples.

VladimirAlexiev commented 4 years ago

Thanks! I see it for customer_datasets_stats.csv.

However, stdout+stderr can be improved:

Eg

Row 10000, PK <whatever1>, TM <person/(customer_id)!map>, total 21000 triples
Row 20000, PK <whatever2>, TM <person/(customer_id)!map>, total 42000 triples
...
Row 10000, PK <whatever1>, TM <person/(customer_id)/birth!map>, total 10020000 triples
Row 20000, PK <whatever2>, TM <person/(customer_id)/birth!map>, total 10040000 triples
dachafra commented 4 years ago

@VladimirAlexiev these statistics are provided in this form because are using to calculate dief@k and dief@t: https://github.com/maribelacosta/dief