Closed 7yl4r closed 2 years ago
@7yl4r Based on this link, it seems that we should not force the columns using dtype because I guess it will error out if a single value cannot be converted.
Although, I did find something else that has the exact functionality that we want with dtype, but will convert each value using the arg converters
along with a function to check if float
capable or returns NaN
. Thats located here. I chose float because when we apply the calibration coefficient, it will turn into a float anyways.
If you could check over what I have so far, that would be great. I still need to implement the looking for extreme values, then the actual separation part. I'm thinking that we try to keep all the data in one file, just add a column that represents sites so as to only have to work with one file instead of many.
I think what you have looks good but it is a bit hard to tell what changed. I am thinking that this week it may be good to work some more on histograms and outlier detection.
Based on a discussion we just had in a 3d wetlands meeting, @luislizcano might have use for creating histograms and other exploratory data visualizations too.
This is mostly taking what you had done, but including it in the master sheet. The main thing I tried doing was adding more read_csv arguments:
converters
- converts the values in the columns to floats or NaNparse_dates
- combines date and time together, but wont change to datetime because of the erroneous lines datetime[ns]
or NaTidx
- remove rows where doesn't match wavelength valueI still have to add the filter for values. According to the manual, the digital counts can only be from 0 to 4120 +/- 5. Histograms might be useful when looking at the data after removing impossible values to correct for times when in air vs water. The values out of water are closer to 4120, I'm pretty sure.
Current usage of
read_csv
can be improved to help deal with malformed rows:Using
parsedates
should be somewhat straightforward.My hope is that the
dtype
parameter will help a lot. If used properly it should stop any non-integer values from getting into integer-only columns. After doing that and dropping all rows with anyNaN
s we might be down to something reasonable.Sadly, there will probably still be bad values in there. Next up for dealing with that is probably something like a Kalman filter.