chop-dbhi / data-models-validator

Set of tools for validating data that should conform to a data model.
1 stars 2 forks source link

Attempt to solve quote escape checking #16

Closed gracebrownecodes closed 8 years ago

gracebrownecodes commented 8 years ago

Add testing file from i2b2_pedsnet 2.0.0 table i2b2. Turn off LazyQuotes and remove EscapedQuotesValidator field validation. Catch csv.Read errors. These errors cannot be attributed to a specific field, since the line failed to parse, so invent a "parseError" field to associate errors with. Unfortunately, when you run on the test file, it reports 6 errors and gets the line number totally wrong...

I believe further inspection of the csv.Read errors and perhaps manual advancement of the underlying io.Reader might be necessary. The csv reader.go source code is here which should help us determine what needs to be done.

Signed-off-by: Aaron Browne aaron0browne@gmail.com

gracebrownecodes commented 8 years ago

Not done but wanted to push before leaving. Any help would be welcome.

gracebrownecodes commented 8 years ago

Improved a little by filtering out the csv.ErrFieldCounts, which obviously come up when there is a csv.ParseError. But I have no idea how to avoid the line count weirdness. Here's what the output looks like right now:

$ data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* A few issues were found
+---------------------+------+--------------------------------+-------------+--------------------------------+
|        FIELD        | CODE |             ERROR              | OCCURRENCES |            SAMPLES             |
+---------------------+------+--------------------------------+-------------+--------------------------------+
| unknown: parseError |  203 | Value contains bare double     |           5 | 3:'':(line:3, column:11)       |
|                     |      | quotes (")                     |             | 4:'':(line:4, column:137)      |
|                     |      |                                |             | 4:'':(line:4, column:137)      |
|                     |      |                                |             | 6:'':(line:6, column:139)      |
|                     |      |                                |             | 3:'':(line:3, column:11)       |
+---------------------+------+--------------------------------+-------------+--------------------------------+

Could definitely use your help, @bruth.

bruth commented 8 years ago

Here is the new output for line-level errors:

% data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* Row-level issues were found.
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| CODE |             ERROR              | OCCURRENCES | LINES |                                                            EXAMPLE                                                             |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
|  203 | Value contains bare double     | 1           | 2     | 2:'"3","\PCORI\VITAL\TOBACCO\SMOKING\","Smoked                                                                                 |
|      | quotes (")                     |             |       | Tobacco","N","FAE",,,,"concept_cd","CONCEPT_DIMENSION","concept_path","T","like","\PCORI\VITAL\TOBACCO\SMOKING\","CDMv2","This |
|      |                                |             |       | field is new to v3.0. Indicator for any form of tobacco that is smoked.Per Meaningful Use guidance, smoking                    |
|      |                                |             |       | status includes any form of tobacco that is smoked, but not all tobacco use. "Light smoker" is interpreted to mean             |
|      |                                |             |       | less than 10 cigarettes per day, or an equivalent (but less concretely defined) quantity of cigar or pipe smoke.               |
|      |                                |             |       | "Heavy smoker" is interpreted to mean greater than 10 cigarettes per day or an equivalent (but less concretely                 |
|      |                                |             |       | defined) quantity of cigar or pipe smoke. ","@","2015-08-20 12:14:14.0","2015-08-20 12:14:14.0","2015-08-20                    |
|      |                                |             |       | 12:14:14.0","PCORNET_CDM",,,"\PCORI\VITAL\TOBACCO\","SMOKING"':(column:356)                                                    |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
gracebrownecodes commented 8 years ago

When run on the test file I just added (which adds a line with properly escaped quotes and a mis-formatted date), the output is wrong:

$ data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* Row-level issues were found.
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| CODE |             ERROR              | OCCURRENCES | LINES |                                                            EXAMPLE                                                             |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
|  203 | Value contains bare double     |           2 | 2-3   | 2:'"3","\PCORI\VITAL\TOBACCO\SMOKING\","Smoked                                                                                 |
|      | quotes (")                     |             |       | Tobacco","N","FAE",,,,"concept_cd","CONCEPT_DIMENSION","concept_path","T","like","\PCORI\VITAL\TOBACCO\SMOKING\","CDMv2","This |
|      |                                |             |       | field is new to v3.0. Indicator for any form of tobacco that is smoked.Per Meaningful Use guidance, smoking                    |
|      |                                |             |       | status includes any form of tobacco that is smoked, but not all tobacco use. "Light smoker" is interpreted to mean             |
|      |                                |             |       | less than 10 cigarettes per day, or an equivalent (but less concretely defined) quantity of cigar or pipe smoke.               |
|      |                                |             |       | "Heavy smoker" is interpreted to mean greater than 10 cigarettes per day or an equivalent (but less concretely                 |
|      |                                |             |       | defined) quantity of cigar or pipe smoke. ","@","2015-08-20 12:14:14.0","2015-08-20 12:14:14.0","2015-08-20                    |
|      |                                |             |       | 12:14:14.0","PCORNET_CDM",,,"\PCORI\VITAL\TOBACCO\","SMOKING"':{column = 356}                                                  |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+

It should only log one quote error and then an additional datetime type mismatch error on line 3.

Also, it seems the line error {column = 356} is an index that starts at 0, but I would suspect most users will expect a "column" to start at 1, no? I'm saying maybe we should ++ it before displaying to users.

gracebrownecodes commented 8 years ago
$ data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* Row-level issues were found.
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| CODE |             ERROR              | OCCURRENCES | LINES |                                                            EXAMPLE                                                             |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
|  203 | Value contains bare double     |           1 |     1 | 1:'"3","\PCORI\VITAL\TOBACCO\SMOKING\","Smoked                                                                                 |
|      | quotes (")                     |             |       | Tobacco","N","FAE",,,,"concept_cd","CONCEPT_DIMENSION","concept_path","T","like","\PCORI\VITAL\TOBACCO\SMOKING\","CDMv2","This |
|      |                                |             |       | field is new to v3.0. Indicator for any form of tobacco that is smoked.Per Meaningful Use guidance, smoking                    |
|      |                                |             |       | status includes any form of tobacco that is smoked, but not all tobacco use. "Light smoker" is interpreted to mean             |
|      |                                |             |       | less than 10 cigarettes per day, or an equivalent (but less concretely defined) quantity of cigar or pipe smoke.               |
|      |                                |             |       | "Heavy smoker" is interpreted to mean greater than 10 cigarettes per day or an equivalent (but less concretely                 |
|      |                                |             |       | defined) quantity of cigar or pipe smoke. ","@","2015-08-20 12:14:14.0","2015-08-20 12:14:14.0","2015-08-20                    |
|      |                                |             |       | 12:14:14.0","PCORNET_CDM",,,"\PCORI\VITAL\TOBACCO\","SMOKING"':{column = 356}                                                  |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
* Field-level issues were found.
+-------------+------+--------------------------------+-------------+--------------------------------+
|    FIELD    | CODE |             ERROR              | OCCURRENCES |            SAMPLES             |
+-------------+------+--------------------------------+-------------+--------------------------------+
| update_date |  308 | Value is not a datetime        |           1 | line 1: `2015-08-20            |
|             |      | (YYYY-MM-DD HH:MM:SS)          |             | 312:14:14.0`                   |
+-------------+------+--------------------------------+-------------+--------------------------------+
bruth commented 8 years ago

I move the parsing to a standalone function. My one concern is that initializing a reader millions of times may producer a lot of garbage, but it should be deallocated pretty quickly.

bruth commented 8 years ago

Other than trying it on a large file (and watching activity monitor to see memory usage), I am satisfied with the changes.

gracebrownecodes commented 8 years ago

OK, but the line numbers are still wrong...

gracebrownecodes commented 8 years ago

Running on CHOP's most recent measurement table, which is 25G. The %CPU was not always that high and I did not see it go higher. The VIRT, RES, SHR and %MEM stayed exactly where they are in the below snapshot as far as I could see. It's actually still running so I can get a better idea of timing for sites.

top - 16:19:12 up 4 days, 20:59,  3 users,  load average: 2.73, 2.25, 2.40
Tasks: 497 total,   2 running, 495 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.6 us,  1.0 sy,  0.0 ni, 91.3 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13181524+total,   433992 free,  3002848 used, 12837840+buff/cache
KiB Swap: 67108860 total, 66727364 free,   381496 used. 96120320 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13479 brownea   20   0   16704   9128   3032 S 166.9  0.0   0:57.61 data-models-val
46405 polkitd   20   0 30.960g 0.030t 0.030t R  93.7 24.5 279:35.92 postgres
42655 polkitd   20   0 30.793g 0.030t 0.030t S  11.3 24.2  26:32.39 postgres
42657 polkitd   20   0 30.793g  18240  17160 S   1.3  0.0   5:29.96 postgres
   58 root      20   0       0      0      0 S   0.3  0.0   5:30.17 rcu_sched
   64 root      20   0       0      0      0 S   0.3  0.0   0:20.03 rcuos/5
   73 root      20   0       0      0      0 S   0.3  0.0   0:23.45 rcuos/14
   79 root      20   0       0      0      0 S   0.3  0.0   0:16.16 rcuos/20
  279 root      20   0       0      0      0 S   0.3  0.0  83:48.88 kswapd0
 1513 root      20   0  985808  20456   2772 S   0.3  0.0   3:58.52 python
13472 brownea   20   0  156840   2520   1468 R   0.3  0.0   0:00.59 top
    1 root      20   0  193360   3748   2120 S   0.0  0.0   0:27.64 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.59 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:03.18 ksoftirqd/0
    4 root      20   0       0      0      0 S   0.0  0.0   0:10.43 kworker/0:0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    8 root      rt   0       0      0      0 S   0.0  0.0   0:00.62 migration/0
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
   10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0
   11 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/2
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/3
   14 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/4
   15 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/5
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/6
   17 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/7
   18 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/8
   19 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/9
   20 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/10
   21 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/11
   22 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/12
   23 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/13
   24 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/14
   25 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/15
   26 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/16
   27 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/17
   28 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/18
   29 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/19
   30 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/20
   31 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/21
   32 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/22
   33 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/23