Open bl-young opened 7 months ago
@bl-young, any updates on this front? I'm still getting the ParseError.
So I took a look at the CSV file that is generated. If you provide pandas.read_csv with nrows, it successfully reads the data up to a point. I tried reading the number of lines in the CSV using a basic approach:
>>> from stewi.RCRAInfo import DIR_RCRA_BY_YEAR
>>> report_year = 2017
>>> filepath = DIR_RCRA_BY_YEAR.joinpath(f'br_reporting_{str(report_year)}.csv')
>>> with open(filepath, 'r') as f:
>>> count = sum(1 for _ in f)
>>> print(count)
2119285
I can open this in pandas.
>>> from stewi.RCRAInfo import RCRA_DATA_PATH
>>> fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
... header=None)
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
... low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
... nrows=2119285)
>>> df.head()
Handler ID State ... Generation Tons Waste Code Group
0 AK0000384040 AK ... 12.25 K171
1 AK0000384040 AK ... 0.20 K171
2 AK0000384040 AK ... 0.40 K050
3 AK0000384040 AK ... 1.50 K050
4 AK0000384040 AK ... 0.05 K050
>>> df.tail(1).to_dict()
{'Handler ID': {2119284: 'IDD073114654'},
'State': {2119284: 'ID'},
'Handler Name': {2119284: 'US ECOLOGY IDAHO INC SITE B'},
'Location Street Number': {2119284: '20400'},
'Location Street 1': {2119284: 'LEMLEY RD'},
'Location Street 2': {2119284: nan},
'Location City': {2119284: 'GRAND VIEW'},
'Location State': {2119284: 'ID'},
'Location Zip': {2119284: '83624'},
'County Name': {2119284: 'OWYHEE'},
'Generator ID Included in NBR': {2119284: 'Y'},
'Generator Waste Stream Included in NBR': {2119284: 'N'},
'Waste Description': {2119284: '43435-0'},
'Primary NAICS': {2119284: nan},
'Source Code': {2119284: nan},
'Form Code': {2119284: nan},
'Management Method': {2119284: nan},
'Federal Waste Flag': {2119284: nan},
'Generation Tons': {2119284: nan},
'Waste Code Group': {2119284: nan}}
I'm not certain this count is accurate because I was able to read more than that with pandas. I can go higher!
>>> df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
... low_memory=False, on_bad_lines='skip', encoding='ISO-8859-1', sep=",",
... nrows=236700)
>>> df.tail(1).to_dict()
{'Handler ID': {2366999: 'IDD073114654'},
'State': {2366999: 'ID'},
'Handler Name': {2366999: 'US ECOLOGY IDAHO INC SITE B'},
'Location Street Number': {2366999: '20400'},
'Location Street 1': {2366999: 'LEMLEY RD'},
'Location Street 2': {2366999: nan},
'Location City': {2366999: 'GRAND VIEW'},
'Location State': {2366999: 'ID'},
'Location Zip': {2366999: '83624'},
'County Name': {2366999: 'OWYHEE'},
'Generator ID Included in NBR': {2366999: 'Y'},
'Generator Waste Stream Included in NBR': {2366999: 'N'},
'Waste Description': {2366999: '43435-0'},
'Primary NAICS': {2366999: nan},
'Source Code': {2366999: nan},
'Form Code': {2366999: nan},
'Management Method': {2366999: nan},
'Federal Waste Flag': {2366999: nan},
'Generation Tons': {2366999: nan},
'Waste Code Group': {2366999: nan}}
Not sure where the upper limit is for nrows, and not sure what happens when you overload nrows.
@bl-young, any updates on this front? I'm still getting the ParseError.
No I have not had a chance to look closely yet. These ParseErrors can be tricky to track down.
For consistency, and in the meantime, I would recommend using the already processed versions, such as via
getInventory(..., download_if_missing=True)
if that works for your application.
Yep. That seems to work! Thanks again for supporting the daisy chain of kwargs down through stewicombo to getInventory.
Originally posted by @dt-woods in https://github.com/USEPA/standardizedinventories/issues/146#issuecomment-1819942869