control logging - Githubissues

VladimirAlexiev commented 4 years ago

I'm converting a moderate table of 1.4M rows, 33 fields (was 280Mb CSV). It produces these files:

customer.nt (I have 4.7Gb ttl from another tool, so expect 10Gb ntriples)
customer_datasets_stats.csv: 27% of nt
log.txt (stdout+stderr): 3% of nt

So the logs are about 30% of the output. I expect this to produce a comparable slow-down. The stats file is especially wasteful, it prints one line per triple (or maybe per subject map):

"Dataset","Number of the triple","Time"
"customer","1","16.364003896713257"
"customer","2","16.364003896713257"
"customer","3","16.365002632141113"
"customer","4","16.365002632141113"
"customer","5","16.365999460220337"
"customer","6","16.3670015335083"
"customer","7","16.368016719818115"

Some progress indication is appreciated, but please print something at every 10k rows, not every triple

eiglesias34 commented 4 years ago

I changed the logs so that there will be only generate something after 10k triples.

VladimirAlexiev commented 4 years ago

Thanks! I see it for customer_datasets_stats.csv.

However, stdout+stderr can be improved:

it prints the table's PK but not which TriplesMap is processed (the same row is processed by many TriplesMaps: #24)
it prints every PK. Instead, print once every 10k and also print a counter of triples and maybe rows

Eg

Row 10000, PK <whatever1>, TM <person/(customer_id)!map>, total 21000 triples
Row 20000, PK <whatever2>, TM <person/(customer_id)!map>, total 42000 triples
...
Row 10000, PK <whatever1>, TM <person/(customer_id)/birth!map>, total 10020000 triples
Row 20000, PK <whatever2>, TM <person/(customer_id)/birth!map>, total 10040000 triples

dachafra commented 4 years ago

@VladimirAlexiev these statistics are provided in this form because are using to calculate dief@k and dief@t: https://github.com/maribelacosta/dief

SDM-TIB / SDM-RDFizer

control logging #23