devoxi / lttb-py

Largest-Triangle-Three-Buckets (LTTB) downsampling algorithm in Python
MIT License
83 stars 19 forks source link

Handling None values in data #3

Open kb0304 opened 5 years ago

kb0304 commented 5 years ago

There can be None values in the data, I thought of the following approach to handle them. Looking forward to hear your comments on the same.

We ignore all the Nones in the start and the end of the data. And, the method for sampling a point from a bucket can be modified as follows (Bucketing method is same as before)

// Pseudocode
a_avg be the average area of all the areas calculated till now.

if (bucket is all Nones){
    return None
}

if (left bucket is all Nones && right bucket is all Nones){
    // Maybe a criteria to choose from the available not None points could be there?
    return first not None element
}

if (left bucket is all Nones && right bucket is not all Nones){
    // let r_avg[x], r_avg[y] be the average of the not Nones in right bucket
    return the point (p[x], p[y]) having maximum area of the triangle formed by 0.5 * |r_avg[y] – p[y]| * | r_avg[x] – p[x]|
}

if (left bucket is not all Nones && right bucket is all Nones){
    // let l_avg[x], l_avg[y] be the average of the not Nones in left bucket
    return the point (p[x], p[y]) having maximum area of the triangle formed by 0.5 * |l_avg[y] – p[y]| * | l_avg[x] – p[x]|
}

Calculate the average only using non None values. 
Compute the area of each point, let p_max be the point with the maximum area, and the max area be a_max

// Idea: None is the most significant sample if there are enough number of Nones in the bucket 
// and area of the triangle computed for rest of the points is not significant enough
if (number of Nones in the bucket > bucket_size/2  && a_max < a_avg)
    return None
}

return p_max