JuliaStats / TimeSeries.jl

Time series toolkit for Julia
Other
352 stars 69 forks source link

readtimearray on duplicate timestamps behaviour #451

Open klangner opened 4 years ago

klangner commented 4 years ago

Currently when trying to read data from the CSV file with duplicate timestamps the function will crash.

Maybe it would be better to add parameter to this function so it will try to read as many rows as possible and then return partial result without crashing? Or maybe just skip duplicate or out of order items?

BTW is there in Julia some kind of optional type? Like Haskell Maybe. Maybe then at least return this type instead of crashing the program?

iblislin commented 4 years ago

Hi @klangner

Maybe it would be better to add parameter to this function so it will try to read as many rows as possible and then return partial result without crashing? Or maybe just skip duplicate or out of order items?

well, in this case, I think you can load the CSV into a DataFrame first, then remove the duplicated rows, then TimeArray(df, timestamp = :MyTimeColumn).

BTW is there in Julia some kind of optional type? Like Haskell Maybe. Maybe then at least return this type instead of crashing the program?

I guess it's Missing?

imbrem commented 4 years ago

I currently implemented this with a very dirty hack, namely passing in open(uniq FILE_NAME), but I would appreciate a flag to just ignore out-of-order entries.

iblislin commented 4 years ago

Hi @imbrem Could you show an example case that it contains duplicated timestamp? I also wondering

  1. If there is a time index in ascending order 2011/1/1, 2011/1/2, 2011/1/2, 2011/1/2, 2011/1/3 with 3 duplicated timestamps, which one do you expect to be skipped?
  2. If there is a time index in descending order, which one do you expect to be skipped?

About out-of-order cases: I'm also curious about that is there an algorithm that can determine the out-of-order entrie?

klangner commented 4 years ago

Hi @iblis17, I would say that you can find duplicate timestamps when dealing with Daylight saving time. Quite often in the data you will see 1 hour missing and half year later duplicate 1 hour data. Also it can happen when the data is not added in increasing time. E.g You get the data from multiple sensors but in batch mode. So you will end up with batches which can have overlapping timestamp. IMHO when you work with real data everything can happen :-)

iblislin commented 4 years ago

I would say that you can find duplicate timestamps when dealing with Daylight saving time. Quite often in the data you will see 1 hour missing and half year later duplicate 1 hour data.

oh, so in this case, the data is still in proper order, only the time index is not ideal. I think applying lag, lead, or some time series method on them is still reasonable. I will consider to release the constrain about the time index, maybe allow duplications.

Also it can happen when the data is not added in increasing time. E.g You get the data from multiple sensors but in batch mode. So you will end up with batches which can have overlapping timestamp.

But for this case, I do not think the method provided by TimeSeries.jl can be applied on these data. It makes no sense if user want to lag, lead, moving... etc on it. So what functionality can we improve/provide to help these kind of data?

iblislin commented 4 years ago

Ah, and just recall that we have an option unchecked, so you can get the out-of-order or duplicated time index work.

TimeArray(ts, vector; unchecked = true)
iblislin commented 4 years ago

Anyway, I made a PR for accepting duplicated but sorted time index.

https://github.com/JuliaStats/TimeSeries.jl/pull/455

imbrem commented 4 years ago

That works fine, but could it also be possible to add an option to actually remove out-of-order or duplicate time stamps, and/or actually go back and update their values in the result array? If desired, I can write the PR for this.

iblislin commented 4 years ago

could it also be possible to add an option to actually remove out-of-order or duplicate time stamps

@imbrem yeah, PRs are welcomed.

and/or actually go back and update their values in the result array?

Updating issues still need more discussions, and I need some time to think about it.