Earth-Information-System / fireatlas

4 stars 2 forks source link

NOAA20 Errors on Reading CSV in v2 and v3 #78

Closed ranchodeluxe closed 1 month ago

ranchodeluxe commented 1 month ago

Problem

Looks like pandas is having trouble reading NOAA20 raw files

Exception: "ParserError('Error tokenizing data. C error: Expected 1 fields in line 8, saw 2\\n')"
Traceback: '  File "/app/fireatlas/fireatlas/utils.py", line 14, in wrap\n    result = f(*args, **kwargs)\n  File "/app/fireatlas/fireatlas/preprocess.py", line 240, in preprocess_input_file\n    df = FireIO.read_VJ114IMGTDL(filepath)\n  File "/app/fireatlas/fireatlas/FireIO.py", line 494, in read_VJ114IMGTDL\n    df = pd.read_csv(filepath)\n  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 617, in _read\n    return parser.read(nrows)\n  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read\n    ) = self._engine.read(  # type: ignore[attr-defined]\n  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read\n    chunks = self._reader.read_low_memory(nrows)\n  File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory\n  File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows\n  File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows\n  File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status\n  File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error\n'

Traceback (most recent call last):
  File "/app/fireatlas/maap_runtime/../fireatlas/FireRunDaskCoordinator.py", line 260, in <module>
    Run([args.regnm, args.bbox], args.tst, args.ted)
  File "/app/fireatlas/fireatlas/utils.py", line 14, in wrap
    result = f(*args, **kwargs)
  File "/app/fireatlas/maap_runtime/../fireatlas/FireRunDaskCoordinator.py", line 197, in Run
    client.gather(data_update_futures)
  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/distributed/client.py", line 2449, in gather
    return self.sync(
  File "/app/fireatlas/fireatlas/utils.py", line 14, in wrap
    result = f(*args, **kwargs)
  File "/app/fireatlas/fireatlas/preprocess.py", line 240, in preprocess_input_file
    df = FireIO.read_VJ114IMGTDL(filepath)
  File "/app/fireatlas/fireatlas/FireIO.py", line 494, in read_VJ114IMGTDL
    df = pd.read_csv(filepath)
  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 617, in _read
    return parser.read(nrows)
  File "/opt/conda/envs/vanilla/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read
jsignell commented 1 month ago

Is this the same version of pandas as always or is it possible there was an update?

ranchodeluxe commented 1 month ago

Is this the same version of pandas as always or is it possible there was an update?

yeah, we have geopandas pinned but don't have pandas pinned. However, the last release of pandas was April 10th so we would've seen it before then

ranchodeluxe commented 1 month ago

when I get back I'll inspect to see if maybe the data is bad? like an extra delimeter

zebbecker commented 1 month ago

Looks like there is something funky with authentication in the data ingestion workflow. Starting on day 194 (Jul 12), we have collected HTML Earthdata login pages instead of the actual CSV data.

ranchodeluxe commented 1 month ago

Looks like there is something funky with authentication in the data ingestion workflow. Starting on day 194 (Jul 12), we have collected HTML Earthdata login pages instead of the actual CSV data.

Yep, i see it now too @zebbecker. We've been here before and need to come up with a token refresh flow

jsignell commented 1 month ago

Maybe we should add ourselves an error message that reminds us that the token might have expired.

ranchodeluxe commented 1 month ago

Maybe we should add ourselves an error message that reminds us that the token might have expired.

The bigger issue is how do we alarm on that message? and where would we see it? We could build something with SQS but honestly that seems like something the async job system should do for us or allow us to tap into