Closed gracebrownecodes closed 8 years ago
Not done but wanted to push before leaving. Any help would be welcome.
Improved a little by filtering out the csv.ErrFieldCount
s, which obviously come up when there is a csv.ParseError
. But I have no idea how to avoid the line count weirdness. Here's what the output looks like right now:
$ data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* A few issues were found
+---------------------+------+--------------------------------+-------------+--------------------------------+
| FIELD | CODE | ERROR | OCCURRENCES | SAMPLES |
+---------------------+------+--------------------------------+-------------+--------------------------------+
| unknown: parseError | 203 | Value contains bare double | 5 | 3:'':(line:3, column:11) |
| | | quotes (") | | 4:'':(line:4, column:137) |
| | | | | 4:'':(line:4, column:137) |
| | | | | 6:'':(line:6, column:139) |
| | | | | 3:'':(line:3, column:11) |
+---------------------+------+--------------------------------+-------------+--------------------------------+
Could definitely use your help, @bruth.
Here is the new output for line-level errors:
% data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* Row-level issues were found.
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| CODE | ERROR | OCCURRENCES | LINES | EXAMPLE |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| 203 | Value contains bare double | 1 | 2 | 2:'"3","\PCORI\VITAL\TOBACCO\SMOKING\","Smoked |
| | quotes (") | | | Tobacco","N","FAE",,,,"concept_cd","CONCEPT_DIMENSION","concept_path","T","like","\PCORI\VITAL\TOBACCO\SMOKING\","CDMv2","This |
| | | | | field is new to v3.0. Indicator for any form of tobacco that is smoked.Per Meaningful Use guidance, smoking |
| | | | | status includes any form of tobacco that is smoked, but not all tobacco use. "Light smoker" is interpreted to mean |
| | | | | less than 10 cigarettes per day, or an equivalent (but less concretely defined) quantity of cigar or pipe smoke. |
| | | | | "Heavy smoker" is interpreted to mean greater than 10 cigarettes per day or an equivalent (but less concretely |
| | | | | defined) quantity of cigar or pipe smoke. ","@","2015-08-20 12:14:14.0","2015-08-20 12:14:14.0","2015-08-20 |
| | | | | 12:14:14.0","PCORNET_CDM",,,"\PCORI\VITAL\TOBACCO\","SMOKING"':(column:356) |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
When run on the test file I just added (which adds a line with properly escaped quotes and a mis-formatted date), the output is wrong:
$ data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* Row-level issues were found.
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| CODE | ERROR | OCCURRENCES | LINES | EXAMPLE |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| 203 | Value contains bare double | 2 | 2-3 | 2:'"3","\PCORI\VITAL\TOBACCO\SMOKING\","Smoked |
| | quotes (") | | | Tobacco","N","FAE",,,,"concept_cd","CONCEPT_DIMENSION","concept_path","T","like","\PCORI\VITAL\TOBACCO\SMOKING\","CDMv2","This |
| | | | | field is new to v3.0. Indicator for any form of tobacco that is smoked.Per Meaningful Use guidance, smoking |
| | | | | status includes any form of tobacco that is smoked, but not all tobacco use. "Light smoker" is interpreted to mean |
| | | | | less than 10 cigarettes per day, or an equivalent (but less concretely defined) quantity of cigar or pipe smoke. |
| | | | | "Heavy smoker" is interpreted to mean greater than 10 cigarettes per day or an equivalent (but less concretely |
| | | | | defined) quantity of cigar or pipe smoke. ","@","2015-08-20 12:14:14.0","2015-08-20 12:14:14.0","2015-08-20 |
| | | | | 12:14:14.0","PCORNET_CDM",,,"\PCORI\VITAL\TOBACCO\","SMOKING"':{column = 356} |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
It should only log one quote error and then an additional datetime type mismatch error on line 3.
Also, it seems the line error {column = 356}
is an index that starts at 0, but I would suspect most users will expect a "column" to start at 1, no? I'm saying maybe we should ++
it before displaying to users.
$ data-models-validator -model i2b2_pedsnet -version 2.0.0 i2b2.csv
Validating against model 'i2b2_pedsnet/2.0.0'
* Evaluating 'i2b2' table in 'i2b2.csv'...
* Row-level issues were found.
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| CODE | ERROR | OCCURRENCES | LINES | EXAMPLE |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
| 203 | Value contains bare double | 1 | 1 | 1:'"3","\PCORI\VITAL\TOBACCO\SMOKING\","Smoked |
| | quotes (") | | | Tobacco","N","FAE",,,,"concept_cd","CONCEPT_DIMENSION","concept_path","T","like","\PCORI\VITAL\TOBACCO\SMOKING\","CDMv2","This |
| | | | | field is new to v3.0. Indicator for any form of tobacco that is smoked.Per Meaningful Use guidance, smoking |
| | | | | status includes any form of tobacco that is smoked, but not all tobacco use. "Light smoker" is interpreted to mean |
| | | | | less than 10 cigarettes per day, or an equivalent (but less concretely defined) quantity of cigar or pipe smoke. |
| | | | | "Heavy smoker" is interpreted to mean greater than 10 cigarettes per day or an equivalent (but less concretely |
| | | | | defined) quantity of cigar or pipe smoke. ","@","2015-08-20 12:14:14.0","2015-08-20 12:14:14.0","2015-08-20 |
| | | | | 12:14:14.0","PCORNET_CDM",,,"\PCORI\VITAL\TOBACCO\","SMOKING"':{column = 356} |
+------+--------------------------------+-------------+-------+--------------------------------------------------------------------------------------------------------------------------------+
* Field-level issues were found.
+-------------+------+--------------------------------+-------------+--------------------------------+
| FIELD | CODE | ERROR | OCCURRENCES | SAMPLES |
+-------------+------+--------------------------------+-------------+--------------------------------+
| update_date | 308 | Value is not a datetime | 1 | line 1: `2015-08-20 |
| | | (YYYY-MM-DD HH:MM:SS) | | 312:14:14.0` |
+-------------+------+--------------------------------+-------------+--------------------------------+
I move the parsing to a standalone function. My one concern is that initializing a reader millions of times may producer a lot of garbage, but it should be deallocated pretty quickly.
Other than trying it on a large file (and watching activity monitor to see memory usage), I am satisfied with the changes.
OK, but the line numbers are still wrong...
Running on CHOP's most recent measurement table, which is 25G. The %CPU was not always that high and I did not see it go higher. The VIRT, RES, SHR and %MEM stayed exactly where they are in the below snapshot as far as I could see. It's actually still running so I can get a better idea of timing for sites.
top - 16:19:12 up 4 days, 20:59, 3 users, load average: 2.73, 2.25, 2.40
Tasks: 497 total, 2 running, 495 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.6 us, 1.0 sy, 0.0 ni, 91.3 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13181524+total, 433992 free, 3002848 used, 12837840+buff/cache
KiB Swap: 67108860 total, 66727364 free, 381496 used. 96120320 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13479 brownea 20 0 16704 9128 3032 S 166.9 0.0 0:57.61 data-models-val
46405 polkitd 20 0 30.960g 0.030t 0.030t R 93.7 24.5 279:35.92 postgres
42655 polkitd 20 0 30.793g 0.030t 0.030t S 11.3 24.2 26:32.39 postgres
42657 polkitd 20 0 30.793g 18240 17160 S 1.3 0.0 5:29.96 postgres
58 root 20 0 0 0 0 S 0.3 0.0 5:30.17 rcu_sched
64 root 20 0 0 0 0 S 0.3 0.0 0:20.03 rcuos/5
73 root 20 0 0 0 0 S 0.3 0.0 0:23.45 rcuos/14
79 root 20 0 0 0 0 S 0.3 0.0 0:16.16 rcuos/20
279 root 20 0 0 0 0 S 0.3 0.0 83:48.88 kswapd0
1513 root 20 0 985808 20456 2772 S 0.3 0.0 3:58.52 python
13472 brownea 20 0 156840 2520 1468 R 0.3 0.0 0:00.59 top
1 root 20 0 193360 3748 2120 S 0.0 0.0 0:27.64 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.59 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:03.18 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:10.43 kworker/0:0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root rt 0 0 0 0 S 0.0 0.0 0:00.62 migration/0
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/1
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/2
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/3
14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/4
15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/5
16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/6
17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/7
18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/8
19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/9
20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/10
21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/11
22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/12
23 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/13
24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/14
25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/15
26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/16
27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/17
28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/18
29 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/19
30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/20
31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/21
32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/22
33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/23
Add testing file from i2b2_pedsnet 2.0.0 table i2b2. Turn off LazyQuotes and remove EscapedQuotesValidator field validation. Catch csv.Read errors. These errors cannot be attributed to a specific field, since the line failed to parse, so invent a "parseError" field to associate errors with. Unfortunately, when you run on the test file, it reports 6 errors and gets the line number totally wrong...
I believe further inspection of the csv.Read errors and perhaps manual advancement of the underlying io.Reader might be necessary. The csv reader.go source code is here which should help us determine what needs to be done.
Signed-off-by: Aaron Browne aaron0browne@gmail.com