Filtering points - Githubissues

bw4sz commented 6 years ago

I'm going to move our convo here for others to be able see. There will be a few of these as I get up and running.

Use case. A tile has some outlier points I want to remove (z < 0) and (z < 100). What's the best way to filter based on cloud.las arrays. in lidR this would be analogous to

lasfilter

tile<-tile %>% lasfilter(Z < 100)
tile<-tile %>% lasfilter(Z > 0)

I can implement this if you give me a bit of a push.

We would need to subset the numpy array based on the condition. Then recalculate the cloud object to have the header update. Anything else?

brycefrank commented 6 years ago

Sounds good.

Let's assume you created a cloud object

cloud1 = cloud.Cloud("samples/sample1.las")

You can then access the raw numpy array of the points (this is read off of the laspy object into memory for now)

point_array  = cloud1.las.points

This is an N x 9 numpy array (for now) where N is the number of points in cloud1.las. What we want is to extract only those rows where, say, z > 400 (or something). Note that z is the 3rd column (index 2). I think there is a way to name dimensions for numpy arrays, so we can index them via 'z' or something more natural. Feel free to look into that (that would make the code cleaner and the package easier to use as well).

First we create a boolean mask, a N x 1 array that is true for the condition met, false otherwise:

mask = point_array[:,2] > 300

Now we can use that mask to subset the original point_array

filtered_array = point_array[mask]

I have something similar cooking in the "boolean_mask" method in rasterizer.py, not sure if I will keep that method, more info in comments.

To be honest, the current LAS I/O in cloud.py isn't terribly thought out and is a bit stitched together. But I think a good first step is returning a new points array with only the rows we want and we can handle the IO stuff later.

bw4sz commented 6 years ago

I'm pretty much in agreement. I was trying to be careful since i'm just playing around here. I jumped on the collab, but i'm staying on my fork for now. To me, I would think that the whole thing is a pandas dataframe, as you do in the raster class. You could go the route of xarrays http://xarray.pydata.org/en/stable/, but thats probably overkill. Lidar data is multidimensional, but in terms of how its used, thinking of it as a big table seems preferable. The question is whether choosing among pandas and numpy hurts some downstream process. I just did the filter method on my fork.

On Wed, Apr 4, 2018 at 12:15 PM, Bryce Frank notifications@github.com wrote:

Sounds good.

Let's assume you created a cloud object

cloud1 = cloud.Cloud("samples/sample1.las")

You can then access the raw numpy array of the points (this is read off of the laspy object into memory for now)

point_array = cloud1.las.points

This is an N x 9 numpy array (for now) where N is the number of points in cloud1.las. What we want is to extract only those rows where, say, z > 400 (or something). Note that z is the 3rd column (index 2).

First we create a boolean mask, a N x 1 array that is true for the condition met, false otherwise:

mask = point_array[:,2] > 300

Now we can use that mask to subset the original point_array

filtered_array = point_array[mask]

I have something similar cooking in the "boolean_mask" method in rasterizer.py, not sure if I will keep that method, more info in comments.

To be honest, the current LAS I/O in cloud.py isn't terribly thought out and is a bit stitched together. But I think a good first step is returning a new points array with only the rows we want and we can handle the IO stuff later.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/brycefrank/pyfor/issues/1#issuecomment-378713920, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJwrONDfsw2vGf4n1FDMGFMmQ_fM5dJks5tlRvJgaJpZM4THRjB .

-- Ben Weinstein, Ph.D. Postdoctoral Fellow University of Florida http://benweinstein.weebly.com/

brycefrank commented 6 years ago

I gave you collab mainly so you can contribute to the projects board (I am not sure if collab was the only way to do it, but I figured why not?). Separate forks sounds good to me as well.

I have been struggling with that decision myself. I don't see any major performance hits from pandas, and using a combination of both feels a bit awkward at times.

This was the method I mentioned earlier: https://docs.scipy.org/doc/numpy-1.14.0/user/basics.rec.html

One reason I can think of maintaining a numpy approach is the easy integration with numba, but on the other hand it is trivial to call .values on a pandas dataframe to convert back to numpy, and I really don't think there is a performance hit there.

It feels like I can only waffle on this decision for now :)

brycefrank commented 6 years ago

I played around a bit with the structured numpy array and a couple of other things. I think I am fine with going a pure pandas approach, and dipping down into numpy only when necessary.

A call to df.values seems to be done in constant time, but importantly a call like pd.DataFrame(point_array) is not constant, but not sure the complexity.

I think I will leave this issue open for now. I would like to filter on any given dimension (intensity, etc). This will be more natural after we implement a pandas version of Cloud.points

brycefrank / pyfor

Filtering points #1