Open DavidUnderdown opened 7 years ago
@DavidUnderdown On the topic of performance there are a couple things that I started but did not have time to finish which could help:
Schema Parsing could be improved by switching to a faster Parser. We started with Scala's Parser Combinators as they are very easy to use and create from an EBNF. We later integrated Packrat Parsing to improve performance. With a little profiling if we determined that the speed issue of the CSV Schema Parser is the actual Scala Parser Combinators, then we could look at switching to either:
Something which previously concerned me was how the support for different Schema versions was implemented, there is much duplicated code and I think that could also have introduced performance issues. The performance of the CSV Validator when it supported just Schema 1.0 should be compared to the latest CSV Validator to determine if there was a slow down.
MetaDataValidator#countRows which counts the number of rows in the CSV file could probably be optimised to be much more efficient, it could read large blocks and then scan for <CR><LF> / <LF>
Certainly in the least the ProgressFor
class which provides the status feedback could be easily extended to add a mode flag which tells the user what phase is being executed, e.g. Parsing Schema / Determining Row Count / Processing CSV.
I also started a branch to drastically speed up the processing of the CSV file itself by switching from OpenCSV
to the Jackson CSV Parser
, it is almost complete... but I never found a moment to go back and finish it. If someone wants to continue with it (before I get around to it) then it is available here: https://github.com/adamretter/csv-validator/tree/csv-parser
Thanks Adam. Re point 3, have to be a little bit careful as it is permitted to have a line break within a quote-wrapped field, so number of line breaks doesn't necessarily equate to the number of rows of CSV.
Parsing side already in #127 - taking this issue forward will look just at amendments to status bar to give a bit more communication as to what work is being done by the validator
It can take a significant length of time (over 5 minutes) to parse complex schemas and begin actual validation. This phase may also include an initial pass through the CSV file to calculate the number of lines to be processed (the longest times for this phase have been observed while processing a CSV file with >1.4million lines). At this point the status bar does not show anything, and there is no indicator that anything is happening other than a spinning cursor, so it can appear that the validator has hung. It would be useful if the status bar did show something, maybe the text "parsing schema file", and if psosible a moving process bar, maybe in a different colour to that used for CSV file processing (this maybe problematic as it's probably difficult to work out what percentage of the task has been completed).