csv processing is slower than json

aspark21 commented 2 years ago

just a side note from last week, had setup a flow to dump the entire table to json & this worked well, however when I had to change it to output to csv it kept hitting the php timeout (30secs), so I had to make tiny batches of 400 rows for it to succeed.

Not sure if there's any performance gains possible on the csv output option but that seemed noteworthy, making json significantly more practical to use.

Back to using JSON, now we've figured how with Snowflake but thought it was worth mentionning

brendanheywood commented 2 years ago

@aspark21 was this when running in the web? This should sort you out:

https://github.com/catalyst/moodle-tool_dataflows/pull/568/files

brendanheywood commented 2 years ago

Note that only addresses the timeouts, I've had a quick look and cannot see any reason why csv would be significantly slower. If it was using the core csv library it would but we are manually writing it.

keevan commented 2 years ago

@brendanheywood I've also had a quick look just then.

Confirming this is during the writer_stream step, the issue could potentially be caused by:

some performance implications of using escaping, and the various checks on each field that is processed in the CSV format, as opposed to the simpler JSON format's implementation, which is to pretty print or not - leaving much of the processing to the internal C implementation to encode the json.
Also the fact that the "PHP" processing in CSV happens per row per field, whereas the JSON encoder happens per row / record only, and much of that is delegated to the underlying C implementation.

I haven't experimented enough with it, but do you know if there was a particular reason Moodle core might have decided to roll its own encoding (which is where I think the implementation would have come from) instead of using PHP's core csv helper functions?

These are known to be pretty quick and capable, so there is room for improvement if we use these functions here instead, if there's no good reason not to. e.g. fputcsv https://www.php.net/manual/en/function.fputcsv.php

For reference, here is the class for the 2 encoders: https://github.com/catalyst/moodle-tool_dataflows/blob/MOODLE_35_STABLE/classes/local/formats/encoders/json.php https://github.com/catalyst/moodle-tool_dataflows/blob/MOODLE_35_STABLE/classes/local/formats/encoders/csv.php

brendanheywood commented 2 years ago

Core has pulled in multuple csv encoders on top of the native one and they are all pretty rubbish at scale which is what prompted https://tracker.moodle.org/browse/MDL-51603 where I pulled in the spout library which was a highly performant csv encoder and used in the Dataformat API. Deprecating the crap ones in core is an epic task that noone wants to own.

https://docs.moodle.org/dev/Data_formats

The original issue in this repo was supposed to just have a 'dataformat' step and then an option inside that to use any of the enabled core dataformats:

https://github.com/catalyst/moodle-tool_dataflows/issues/23

But the plot thickens because the dataformat api when I originally wrote that was all about streaming data to the web efficiently, later on it was retrofitted to be able to stream to a file on disk. I can't actually remember what the issue was but @jaypha ran into trouble using that api. Ideally we should go in that direction. I can't find the comments around that.

From my sketchy memory fputcsv probably is performant when streaming to disk it may not be ideal when streaming to the web.

jaypha commented 2 years ago

The problem was that it forced either writing to disk or outputting to the web. It wasn't capable just simply encoding data -> data.

brendanheywood commented 2 years ago

@jaypha I'm confused, the whole point of this step it to take the stream of data and encode it onto disk, why would we want data -> data?

jaypha commented 2 years ago

It is done one row at a time, not all at once. And the step is about passing it on to the next step. It is not assumed that it is to be outputted to a file.

Here is the original conversation. https://github.com/catalyst/moodle-tool_dataflows/pull/138#discussion_r884337715

jaypha commented 2 years ago

I checked the code, and it is designed to be able to send to an arbitrary stream, as well as the next step downstream.

brendanheywood commented 2 years ago

I think we could have gotten that to work, we are producing an iterator and the dataformats consumes an iterator, it should just be wiring code.

Even if and when we refactor this to be event based instead of iterator based I think we can still make this work using a generator:

https://www.php.net/manual/en/language.generators.overview.php

If think for now lets just see if there is any low hanging fruit to speed this up, and then we defer re-investigating a dataformat based step until after the refactor(if that happens)

brendanheywood commented 2 years ago

I checked the code, and it is designed to be able to send to an arbitrary stream, as well as the next step downstream.

I would not expect it to pass on the encoded data to the next step, it would pass the data on it was given unchanged and encode the data to disk as a side effect

catalyst / moodle-tool_dataflows

csv processing is slower than json #567