digital-preservation / csv-validator

CSV Validation Tool and API (CSV Schema RI)
http://digital-preservation.github.io/csv-validator
Mozilla Public License 2.0
203 stars 55 forks source link

Parsing schema before beginning data validation can take a significant length of time with complex schema, status bar should give more info on that stage #141

Open DavidUnderdown opened 7 years ago

DavidUnderdown commented 7 years ago

It can take a significant length of time (over 5 minutes) to parse complex schemas and begin actual validation. This phase may also include an initial pass through the CSV file to calculate the number of lines to be processed (the longest times for this phase have been observed while processing a CSV file with >1.4million lines). At this point the status bar does not show anything, and there is no indicator that anything is happening other than a spinning cursor, so it can appear that the validator has hung. It would be useful if the status bar did show something, maybe the text "parsing schema file", and if psosible a moving process bar, maybe in a different colour to that used for CSV file processing (this maybe problematic as it's probably difficult to work out what percentage of the task has been completed).

adamretter commented 7 years ago

@DavidUnderdown On the topic of performance there are a couple things that I started but did not have time to finish which could help:

  1. Schema Parsing could be improved by switching to a faster Parser. We started with Scala's Parser Combinators as they are very easy to use and create from an EBNF. We later integrated Packrat Parsing to improve performance. With a little profiling if we determined that the speed issue of the CSV Schema Parser is the actual Scala Parser Combinators, then we could look at switching to either:

    1. Parboiled which is much faster than Scala's Parser Combinators, and whose DSL still allows you to remain close to the EBNF. I actually started on a branch of CSV Validator which did just this, but never had time to complete it.
    2. Antlr 4 which is pretty much the leading parser framework for Java these days. It has its own unique syntax which is further away from (but not completely different to) EBNF.
  2. Something which previously concerned me was how the support for different Schema versions was implemented, there is much duplicated code and I think that could also have introduced performance issues. The performance of the CSV Validator when it supported just Schema 1.0 should be compared to the latest CSV Validator to determine if there was a slow down.

  3. MetaDataValidator#countRows which counts the number of rows in the CSV file could probably be optimised to be much more efficient, it could read large blocks and then scan for <CR><LF> / <LF>

  4. Certainly in the least the ProgressFor class which provides the status feedback could be easily extended to add a mode flag which tells the user what phase is being executed, e.g. Parsing Schema / Determining Row Count / Processing CSV.

  5. I also started a branch to drastically speed up the processing of the CSV file itself by switching from OpenCSV to the Jackson CSV Parser, it is almost complete... but I never found a moment to go back and finish it. If someone wants to continue with it (before I get around to it) then it is available here: https://github.com/adamretter/csv-validator/tree/csv-parser

DavidUnderdown commented 7 years ago

Thanks Adam. Re point 3, have to be a little bit careful as it is permitted to have a line break within a quote-wrapped field, so number of line breaks doesn't necessarily equate to the number of rows of CSV.

alexgreenDP commented 7 years ago

Parsing side already in #127 - taking this issue forward will look just at amendments to status bar to give a bit more communication as to what work is being done by the validator