parser should be able to ignore quotation marks

esonderegger / fecfile

a python parser for the .fec file format

https://esonderegger.github.io/fecfile/

Apache License 2.0

44 stars 13 forks source link

parser should be able to ignore quotation marks #15

Closed esonderegger closed 4 years ago

esonderegger commented 5 years ago

In some filings, fields are enclosed in quotation marks even though they don't need to be. That means the parser sees values like "4247.66" and says "that doesn't look like a number to me".

I think if a value that is supposed to be numeric begins and ends with " after we call strip() on it, then we should try again with value[1:-1]

hodgesmr commented 5 years ago

Another idea could be value extraction with literal_eval:

Safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None.

This can be used for safely evaluating strings containing Python values from untrusted sources without the need to parse the values oneself. It is not capable of evaluating arbitrarily complex expressions, for example involving operators or indexing.

from ast import literal_eval

tests = ["-13.2", "15.4", "8", "9.0", "10.", "8.22"]

for test in tests:
    val = literal_eval(test)
    print('-----')
    print(val)
    print(type(val))

-----
-13.2
<class 'float'>
-----
15.4
<class 'float'>
-----
8
<class 'int'>
-----
9.0
<class 'float'>
-----
10.0
<class 'float'>
-----
8.22
<class 'float'>

hodgesmr commented 5 years ago

Is there an example file that produces this error? It looks this commit attempts to implement the solution you mentioned?

esonderegger commented 4 years ago

I'm sorry that it's taken me so long to reply to you about this! I believe this was the filing that caused me to write the ticket: https://docquery.fec.gov/dcdev/posted/1157513.fec

Thank you for the tip about literal_eval! Unfortunately in this case I don't think it would help much because our initial value is a string that is almost never quoted, but every now and then is.

So in this case, the test would be more like:

tests = ['123.45', '"56.78"', '"HDR"', 'HDR']

-----
123.45
<class 'float'>
-----
56.78
<class 'str'>
-----
HDR
<class 'str'>
Traceback (most recent call last):
...
ValueError: malformed node or string: <_ast.Name object at ...>

And you are right - it looks like I resolved this with the commit you linked to, so I should have closed it at the time. Sorry for the confusion!