ParserError in RCRAInfo

bl-young commented 7 months ago

          So I tried accessing other years of RCRAInfo data (2013, 2015, 2017, and 2019). All worked except for one (2017), which produced the following errors. I wasn't able to track down the CSV file it keeps crashing on. Maybe there's a debug statement that points to it.

INFO RCRAInfo_2017 not found in ~/stewi/flowbyfacility
INFO requested inventory does not exist in local directory, it will be generated...
INFO file extraction complete
INFO organizing data for BR_REPORTING from 2017...
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_0.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_1.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_2.csv
INFO saving to ~/stewi/RCRAInfo Data Files/RCRAInfo_by_year/br_reporting_2017.csv...
INFO generating inventory files for 2017
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[9], line 1
----> 1 stewi.getInventory('RCRAInfo', 2017)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
     66 """Return or generate an inventory in a standard output format.
     67 
     68 :param inventory_acronym: like 'TRI'
   (...)
     79 :return: dataframe with standard fields depending on output format
     80 """
     81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
     83                            download_if_missing)
     85 if (not keep_sec_cntx) and ('Compartment' in inventory):
     86     inventory['Compartment'] = (inventory['Compartment']
     87                                 .str.partition('/')[0])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:268, in read_inventory(inventory_acronym, year, f, download_if_missing)
    265 else:
    266     log.info('requested inventory does not exist in local directory, '
    267              'it will be generated...')
--> 268     generate_inventory(inventory_acronym, year)
    269 inventory = load_preprocessed_output(meta, paths)
    270 if inventory is None:

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:313, in generate_inventory(inventory_acronym, year)
    309     RCRAInfo.main(Option = 'A', Year = [year],
    310                   Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
    311     RCRAInfo.main(Option = 'B', Year = [year],
    312                   Tables = ['BR_REPORTING'])
--> 313     RCRAInfo.main(Option = 'C', Year = [year])
    314 elif inventory_acronym == 'TRI':
    315     import stewi.TRI as TRI

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:444, in main(**kwargs)
    441     organize_br_reporting_files_by_year(kwargs['Tables'], year)
    443 elif kwargs['Option'] == 'C':
--> 444     Generate_RCRAInfo_files_csv(year)
    446 elif kwargs['Option'] == 'D':
    447     """State totals are compiled from the Trends Analysis website
    448     and stored as csv. New years will be added as data becomes
    449     available"""

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:219, in Generate_RCRAInfo_files_csv(report_year)
    216 fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
    217                            header=None)
    218 # on_bad_lines requires pandas >= 1.3
--> 219 df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
    220                  low_memory=False, on_bad_lines='skip',
    221                  encoding='ISO-8859-1')
    223 log.info(f'completed reading {filepath}')
    224 # Checking the Waste Generation Data Health

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:239, in CParserWrapper.read(self, nrows)
    236         data = _concatenate_chunks(chunks)
    238     else:
--> 239         data = self._reader.read(nrows)
    240 except StopIteration:
    241     if self._first_chunk:

File parsers.pyx:825, in pandas._libs.parsers.TextReader.read()

File parsers.pyx:913, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Originally posted by @dt-woods in https://github.com/USEPA/standardizedinventories/issues/146#issuecomment-1819942869

dt-woods commented 6 months ago

@bl-young, any updates on this front? I'm still getting the ParseError.

dt-woods commented 6 months ago

So I took a look at the CSV file that is generated. If you provide pandas.read_csv with nrows, it successfully reads the data up to a point. I tried reading the number of lines in the CSV using a basic approach:

>>> from stewi.RCRAInfo import DIR_RCRA_BY_YEAR
>>> report_year = 2017
>>> filepath = DIR_RCRA_BY_YEAR.joinpath(f'br_reporting_{str(report_year)}.csv')
>>> with open(filepath, 'r') as f:
>>>     count = sum(1 for _ in f)
>>> print(count)
2119285

I can open this in pandas.

>>> from stewi.RCRAInfo import RCRA_DATA_PATH
>>> fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
...                            header=None)
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
...     low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
...     nrows=2119285)
>>> df.head()
     Handler ID State  ... Generation Tons Waste Code Group
0  AK0000384040    AK  ...           12.25             K171
1  AK0000384040    AK  ...            0.20             K171
2  AK0000384040    AK  ...            0.40             K050
3  AK0000384040    AK  ...            1.50             K050
4  AK0000384040    AK  ...            0.05             K050
>>> df.tail(1).to_dict()
{'Handler ID': {2119284: 'IDD073114654'},
 'State': {2119284: 'ID'},
 'Handler Name': {2119284: 'US ECOLOGY IDAHO INC SITE B'},
 'Location Street Number': {2119284: '20400'},
 'Location Street 1': {2119284: 'LEMLEY RD'},
 'Location Street 2': {2119284: nan},
 'Location City': {2119284: 'GRAND VIEW'},
 'Location State': {2119284: 'ID'},
 'Location Zip': {2119284: '83624'},
 'County Name': {2119284: 'OWYHEE'},
 'Generator ID Included in NBR': {2119284: 'Y'},
 'Generator Waste Stream Included in NBR': {2119284: 'N'},
 'Waste Description': {2119284: '43435-0'},
 'Primary NAICS': {2119284: nan},
 'Source Code': {2119284: nan},
 'Form Code': {2119284: nan},
 'Management Method': {2119284: nan},
 'Federal Waste Flag': {2119284: nan},
 'Generation Tons': {2119284: nan},
 'Waste Code Group': {2119284: nan}}

I'm not certain this count is accurate because I was able to read more than that with pandas. I can go higher!

>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
...     low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
...     nrows=236700)
>>> df.tail(1).to_dict()
{'Handler ID': {2366999: 'IDD073114654'},
 'State': {2366999: 'ID'},
 'Handler Name': {2366999: 'US ECOLOGY IDAHO INC SITE B'},
 'Location Street Number': {2366999: '20400'},
 'Location Street 1': {2366999: 'LEMLEY RD'},
 'Location Street 2': {2366999: nan},
 'Location City': {2366999: 'GRAND VIEW'},
 'Location State': {2366999: 'ID'},
 'Location Zip': {2366999: '83624'},
 'County Name': {2366999: 'OWYHEE'},
 'Generator ID Included in NBR': {2366999: 'Y'},
 'Generator Waste Stream Included in NBR': {2366999: 'N'},
 'Waste Description': {2366999: '43435-0'},
 'Primary NAICS': {2366999: nan},
 'Source Code': {2366999: nan},
 'Form Code': {2366999: nan},
 'Management Method': {2366999: nan},
 'Federal Waste Flag': {2366999: nan},
 'Generation Tons': {2366999: nan},
 'Waste Code Group': {2366999: nan}}

Not sure where the upper limit is for nrows, and not sure what happens when you overload nrows.

bl-young commented 6 months ago

@bl-young, any updates on this front? I'm still getting the ParseError.

No I have not had a chance to look closely yet. These ParseErrors can be tricky to track down.

For consistency, and in the meantime, I would recommend using the already processed versions, such as via getInventory(..., download_if_missing=True) if that works for your application.

dt-woods commented 6 months ago

Yep. That seems to work! Thanks again for supporting the daisy chain of kwargs down through stewicombo to getInventory.

USEPA / standardizedinventories

ParserError in RCRAInfo #151