UChicago-Coase-Sandor / pacer_lib

http://pacer-lib.readthedocs.org/
9 stars 11 forks source link

Description length causing errors #3

Closed synsypa closed 10 years ago

synsypa commented 10 years ago

Dockets with overly long descriptions produce the issue : Traceback (most recent call last): File "read_docket_0224.py", line 7, in a.search_dir("attorney fee") File "/cygdrive/c/users/kjiang/projects/henderson/docs_022014/reader_debug.py", line 803, in search_dir exclude_term, case_sensitive, within) File "/cygdrive/c/users/kjiang/projects/henderson/docs_022014/reader_debug.py", line 762, in search_docket for num, row in enumerate(docket_reader): _csv.Error: field larger than field limit (131072)

We should truncate descriptions when parsing to prevent this.

zhangchuck commented 10 years ago

We should truncate and then add a "(TRUNCATED)" tag at the end. This suggests a max length of 131,061.

Should we truncate even smaller? Assuming an average word-length of 6 (five letter word + whitespace), 131061/6 = ~22,000 words. I think we can safely say that we really only need a max of 5,000 words or something.

I think this should be implemented at the docket_parser().

zhangchuck commented 10 years ago

Truncated in docket_parser.parse_data():

        # Truncate extremely long cells:
        for n, content in enumerate(row_contents):
            if len(content) > 20000:
                row_contents[n] = content[0:20001] + "(TRUNCATED)"