use more read_csv features to get even cleaner BB3 data

7yl4r commented 3 years ago

Current usage of read_csv can be improved to help deal with malformed rows:

bb3_df = pandas.read_csv(
    FILEPATH, 
    sep='\t',
    on_bad_lines='warn',  # default is 'error'. can also use 'warn' and 'skip'
    names=["date","time","470nm", "470nm_data","532nm", "532nm_data", "650nm", "650nm_data", "mystery_column"],
    skiprows=1,
    # TODO: use more features to better filter
    # mangle_dupe_cols=False,  # not sure if this will be useful
    # dtype={"470nm":np.int64, ...}  # this should fix a lot
    # parsedates = [["date", "time"]]  # this should parse the two columns together as one datetime
)

Using parsedates should be somewhat straightforward.

My hope is that the dtype parameter will help a lot. If used properly it should stop any non-integer values from getting into integer-only columns. After doing that and dropping all rows with any NaNs we might be down to something reasonable.

Sadly, there will probably still be bad values in there. Next up for dealing with that is probably something like a Kalman filter.

sebastiandig commented 3 years ago

@7yl4r Based on this link, it seems that we should not force the columns using dtype because I guess it will error out if a single value cannot be converted.

Although, I did find something else that has the exact functionality that we want with dtype, but will convert each value using the arg converters along with a function to check if float capable or returns NaN. Thats located here. I chose float because when we apply the calibration coefficient, it will turn into a float anyways.

If you could check over what I have so far, that would be great. I still need to implement the looking for extreme values, then the actual separation part. I'm thinking that we try to keep all the data in one file, just add a column that represents sites so as to only have to work with one file instead of many.

7yl4r commented 3 years ago

I think what you have looks good but it is a bit hard to tell what changed. I am thinking that this week it may be good to work some more on histograms and outlier detection.

Based on a discussion we just had in a 3d wetlands meeting, @luislizcano might have use for creating histograms and other exploratory data visualizations too.

sebastiandig commented 3 years ago

This is mostly taking what you had done, but including it in the master sheet. The main thing I tried doing was adding more read_csv arguments:

using converters - converts the values in the columns to floats or NaN
using parse_dates - combines date and time together, but wont change to datetime because of the erroneous lines
using 'to_datetime' - converts dtype to datetime[ns] or NaT
using idx - remove rows where doesn't match wavelength value

I still have to add the filter for values. According to the manual, the digital counts can only be from 0 to 4120 +/- 5. Histograms might be useful when looking at the data after removing impossible values to correct for times when in air vs water. The values out of water are closer to 4120, I'm pretty sure.

USF-IMARS / python-tech-workgroup

use more read_csv features to get even cleaner BB3 data #22