OpenWaterFoundation / cdss-app-snodas-tools

Colorado's Decision Support Systems (CDSS) Snow Data Assimilation System (SNODAS) Tools
8 stars 4 forks source link

Check and if necessary improve robustness of handling the main csv data files #36

Open smalers opened 3 years ago

smalers commented 3 years ago

The ByBasin files contain a full history of data. These files are appended to each day. Given that we need to rerun the full period on GCP anyway, perhaps now is the time to confirm that some use cases are handled. Otherwise, if something goes wrong with the files, we may be forced to rerun the full period. Below are technical considerations:

  1. It would be nice if the files were guaranteed to be sorted from earliest date first to last. Otherwise, an out of sequence record would be difficult to spot. TSTool does not care because it parses the date/time and uses that to set data. For example, what happens if one period is processed, then another period after a gap in several days, and then the gap days are processed? Do the last records get inserted in the proper order or are they at the end?
  2. Does rerunning a day update the file for the day? This can be tested by editing the file beforehand and then rerunning to see if it gets updated. Hopefully the second run does not result in a redundant record in the file.
  3. Sorting in the Python or linux sort command should probably just deal with string sorts rather than having to parse the columns. Sorting should retain the column heading in row 1. The file has header column that could be removed, then sort, then re-add the header. I think we avoided using # comments because we wanted the CSV files to directly read into Excel.
  4. There is a risk that something causes the ByBasin files to be corrupted. If error handling is OK, then the risk is low. Below are some options to decrease risk:
    1. Work with the State to back up the files so that they can be restored. This may be a pain. Versioning can probably be turned on for the bucket, but that might actually add a lot of storage given that a 500K file for each basin would be saved each day.
    2. Add logic to make a copy of the previous version(s) before processing the current day. This may not work well because if the previous version got screwed up it would not help.
    3. Periodically (January 1?) zip the files and save to a folder so that worst case only one year would need to be run. This would require that rerunning the same day overrides the previous update (which may be missing or bad values). The zip file would need to be in a separate location than the working files because we don't want the daily process re-uploading to GCP. I think the zip files could just live on the GCP VM because it is really a backup. Therefore, there would be as many backups as January 1 in the period.