Non time-synced data - Githubissues

IanMayo commented 7 years ago

@dominikschauer identified that in the two .csv datasets there are some times when position isn't available for both vehicles.

This is a common problem with experimental data. Sometimes we will wish to compare two time series datasets that are either recorded at different frequencies, or where the data recording is sporadic.

We should investigate how to Knime can time to interpolate other recorded measurements. For example, to find the difference in speed between the two vehicles when their data is recorded at different timestamps.

It is acceptable to produce modified versions of the sample datasets to introduce an enhanced time-syncing problem. So, usa.csv could be trimmed to just twenty lines, randomly taken through the time period, and nzl.csv trimmed to only twn lines, randomly taken through the time period.

We would then use Knime to produce a graph of differences in speed, using the values in usa.csv, plus an interpolated value from nzl.csv - approximating the speed at the usa.csv timestamp.

IanMayo commented 7 years ago

@dominikschauer - I've had a go at updating the description. Hopefully it makes more sense now.

dmschauer commented 7 years ago

@IanMayo If I understand the issue correctly it can be solved by doing a simple random sampling out of either usa.csv or nzl.csv and taking the matching time stamps out of the other data set. The next step would be computing the average speed of the two vehicles in the intervals between the randomly taken time stamps. The final step would be creating a graph showing two line graphs (time series) of the speeds at different times. I will implement this now hoping that it is what is asked for. I also have the suggestion to use fixed intervals of the same length instead of random ones.

dmschauer commented 7 years ago

Here is a KNIME workflow that in my mind does what is requested in this issue.

KNIME_UPWORK_sampling_and_line_graphs.zip

IanMayo commented 7 years ago

@dominikschauer - I haven't encountered random sampling to solve this kind of issue before. My guess would have been to choose the most frequent, and use that as the "time-master", producing interpolated values for the other dataset.

In this instance I believe they're of equal frequency, so we would use either as the master.

dmschauer commented 7 years ago

@IanMayo Yes, I think the idea of random sampling was just due to poor understanding of your requirements on my part. I was thinking this is what you were describing. Otherwise I would not have thought about doing this.

So, do you want to "fill in" the speed for time stamps when there is no record for either of the two data sets? For example when nzl.csv does have a speed for time 08:00:00 but usa.csv has not, I would use the nearest recorded measurements for usa.csv and compute the average. Like: speed_usa(08:00:00) = (speed_usa(07:59:59) + speed_usa(08:00:01) ) / 2. Then I would use this average as interpolation for the missing measurement and compare it to speed_nzl(08:00:00) in a line graph.

I think it would be a bad practice of me to implement this before you gave your okay. So is this what you have in mind?

IanMayo commented 7 years ago

@dinkoivanov - aah, now I see the random confusion. I was referring the preparation of some smaller, temporary datasets, for use in this issue. I suggested you produced custom versions of the data-files that contained random lines from the originals. It was these data-files that would be used to test the time-synced comparison.

Yes your interpolation strategy is fine.

Oops, I've missed an answer. No, let's not produce a value at a time for which there isn't a measurement in both datasets. Let's just interpolate values for times that are present in one (the master) but not the other. Hope this is clear. :-)

dmschauer commented 7 years ago

In this workflow USA.csv serves as the "time-master" and all missing values for NZL.csv are interpolated as described above. I found this interpolation strategy doesn't work when the first or last value is missing. For these trailing missing entries in NZL.csv the first non-missing and last non-missing values are used. For example "NA NA 2 3 4 NA NA NA" becomes "2 2 2 3 4 4 4 4".

KNIME_UPWORK_non_time_synced_data.zip

debrief / KnimeInvestigation

Non time-synced data #2