How to properly pass a numpy array containing XYZ points to pdal.Pipeline?

ToddJacobus commented 5 years ago

The documentation is very clear on how to read data from files on the disk in a variety of formats, however, it's very unclear how to read data from a numpy array in memory. It looks like the readers.numpy reader is what I'm looking for, but the docs seem to imply that this reader is still accessing a file on the disk, from a .npy file. Ideally, I would like to do something like this:

arr = np.array([x1,y1,z1], [x2,y2,z2], [x3,y3,z3]) # really any numpy array shape/dimensions will do.

p = pdal.Pipeline(json, arr)

It looks like from the source that it is possible to include a numpy array as an argument to the Pipeline class, but can't figure out what it wants (i.e., shape, dimensions, etc.) exactly, or if this is the intended usage.

Ultimately, I'm trying to add PDAL to an existing data pipeline, written in Python, that pulls data from a variety of sources and it would be prohibitively expensive to keep saving data to the disk, just to read it back into PDAL.

I'm still new to PDAL so please forgive any ignorance. Thanks very much for your continued development!

abellgithub commented 5 years ago

There are lots of options - probably too many. I would recommend doing one of two things:

1) Create a structured Numpy array that is a row-based description of your data. The Numpy descriptor should contain fields named X, Y and Z, as well as any other fields you want to process. PDAL will interpret each record as a point, with dimensions matching the Numpy descriptor. This allows complete control of the X, Y and Z values.

2) Create a Numpy array with a shape that has 3 Numpy dimensions -- [depth, rows, columns]. If the entries in the array are simple data, they will be mapped to the Intensity dimension in PDAL, though filters.ferry can be used to move the data to some other dimension. If the entries are elements of a structured array, the elements of each entry will be mapped to similarly named PDAL dimensions. In this case X, Y and Z will be integers, described by the depth/row/column position of the entries, creating a gridded point cloud (voxels).

These arrays can be passed to the Pipeline constructor following the JSON pipeline description.

There are other options that are essentially some hybrid of the two described above. I'd be happy to answer specific questions you might have.

ToddJacobus commented 5 years ago

Thanks for your quick reply!

Using your first suggestion, I'm trying to now implement the following in a isolated test environment (jupyter notebook running python 3.6):

import pdal
import numpy as np
from pathlib import Path

def dataToStructuredArray(inputDir):
    # The input file here (just for testing, in production these data will come from a database)
    # is a whitespace-delimited text file where x y z coordinates are on each line; no header.
    with Path(inputDir).open() as file:
        return np.array(
            [tuple(line.strip().split()) for line in file],
            dtype=[('x',np.float), ('y',np.float), ('z',np.float)]
        )

INPUT_FILE = "path/to/xyz/text/file.xyz"

test_filter = """
{
  "pipeline": [
    {
        "type": "filters.stats",
        "dimensions": "x,y,z"
    }
  ]
}
"""

data = dataToStructuredArray(INPUT_FILE)
p = pdal.Pipeline(json = test_filter, arrays = data)
p.validate()
result = p.execute()

When I pass the structured numpy array to the Pipeline constructor, it snags at the Truth evaluation of if arrays in its initialization:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

It looks like the arrays parameter is looking for a list, however, so I tried passing it a list of rows as tuples, instead of my structured np.array:

p = pdal.Pipeline(json = test_filter, arrays = list(data))

Of course, Pipeline complains with the following:

RuntimeError: pdal::python::Array constructor object is not a numpy array

I also tried passing in my numpy array in a list:

p = pdal.Pipeline(json = test_filter, arrays = [data,])

This doesn't return any errors while p.validate() returns True. However, p.execute() returns the integer 959 which happens to be the length of my numpy array.

It looks like the structure of my numpy array is correct, since it matches the example in the readers.numpy docs:

data[0:1]

array([(1.70087325, 4.98036483, -3490.)],
      dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

Is my json pipeline definition incorrect? More specifically, do I need to explicitly add a { "type": "readers.numpy" } item?

Or am I still not passing my Numpy array to the Pipeline constructor correctly?

Thanks very much, again.

abellgithub commented 5 years ago

Receiving the number of points as your result is correct. If you want to access the metadata generated by the stats filter, there is a metadata property of the Pipeline.

ToddJacobus commented 5 years ago

Excellent! Looks like I understand what's happening now. Thanks again for your help.

abellgithub commented 5 years ago

Sorry for the hassle. I'll try to add better documentation when I can.

On Tue, Aug 13, 2019 at 4:26 PM Todd Jacobus notifications@github.com wrote:

Excellent! Looks like I understand what's happening now. Thanks again for your help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PDAL/python/issues/28?email_source=notifications&email_token=AAKBMMF3ESK7FHCBLE5P5VTQEMKIHA5CNFSM4ILFA6FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4G4GNQ#issuecomment-520995638, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKBMME6HWG4NAVHIJWOKUTQEMKIHANCNFSM4ILFA6FA .

-- Andrew Bell andrew.bell.ia@gmail.com

pepijntje02 commented 3 years ago

I am not sure if more people are struggling with the laspy and pdal library combo, but I faced the same problem. I read a LAS file with laspy, then I wanted to use PDAL. For me, the solution was to put the points in a list (as described above with [data,]), but also use ['point'] in the laspy cloud:

import laspy
import pdal

cloud_name = 'filename.las'
cloud = laspy.file.File(cloud_name)
filter = '''
Your pipeline for pdal
'''

pipeline = pdal.Pipeline(json=filter, arrays=[cloud.points['point'],])

My expectation was that it would be possible to input the whole cloud.points array in pdal, but apparently this was not the case. Nevertheless, thanks you for your tips on how to make this work, and PDAL is a great library.

abellgithub commented 3 years ago

Since python doesn't have typed variables, I really have no idea what the output cloud is. The array argument expects a numpy array containing the point data, so whatever you have to do to get from a laspy cloud to a numpy array of point data, that's what you want.

abellgithub commented 3 years ago

You're welcome to make a PR that deals with laspy directly.

pepijntje02 commented 3 years ago

Hi,

Sorry, but the purpose of my comment was to help others who are facing the same issue as I am. It was not to tell that the pdal library is missing something, but more about how to work with it in combination with the laspy library. It is not even a "work around", but in my opinion a method about how one can use the pdal library as it is.

In my case the cloud variable has type laspy.file.File. Therefore I also provided some dummy code so it was clear how the variable is defined.

But what I did was I read a *.las file using laspy.file.File and stored it as cloud. Then type(cloud.points) is a numpy.array. This is what is looks like:

cloud.points[0:1]
array([((180774, -68321, 430, 37180, 73, 2, -89, 0, 0, 385597.69748272, 0, 0, 0),)],
              dtype=[('point', [('X', '<i4'), ('Y', '<i4'), ('Z', '<i4'), ('intensity', '<u2'), ('flag_byte', 'u1'), ('raw_classification', 'u1'), ('scan_angle_rank', 'i1'), ('user_data', 'u1'), ('pt_src_id', '<u2'), ('gps_time', '<f8'), ('red', '<u2'), ('green', '<u2'), ('blue', '<u2')])])

First the filter variable is defined for later use, for example:

filter='''
[
  {
    "type":"filters.randomize"
  }
]'''

I wanted to use the numpy.array for the pdal.Pipeline

pipeline = pdal.Pipeline(json=filter, arrays=cloud.points)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

So then I arrived at this page and saw that the arrays variable expected to be a list, so the next try was:

pipeline = pdal.Pipeline(json=filter, arrays=[cloud.points])
RuntimeError: Incompatible type for field 'point'.

So while the input was a numpy.array, it still didn't work. The next thing I tried is to define the 'point' value, since it is in the numpy array (this can be seen in the dtype above.

pipeline = pdal.Pipeline(json=filter, arrays=[cloud.points['point']])
pipeline.validate() -> True

This was my "solution" about how to use a laspy in combination with pdal.

I am not saying that there is something wrong with either laspy of pdal, but I tried to help others who run in the same issue. I think that reading data using laspy and then using pdal is a "logical" choice and therefore I expected that maybe this explanation can help others for their solution. If it is not the intention of this place (in an issue on github) to comment with parts of code to help others please say so, then I will remove these comments. I am new to commenting on github (this was my first comment) and thought that it had the same purpose as a forum like stackoverflow.

abellgithub commented 3 years ago

You should make a documentation PR.

digital-idiot commented 3 years ago

@abellgithub Can this be done iteratively? In each iteration a subset of points get flushed to disk?

PDAL / python

How to properly pass a numpy array containing XYZ points to pdal.Pipeline? #28