PDAL / PDAL

PDAL is Point Data Abstraction Library. GDAL for point cloud data.
https://pdal.io
Other
1.12k stars 443 forks source link

Help needed reading EPT from azure blob container with SAS token #4451

Open trns1997 opened 1 month ago

trns1997 commented 1 month ago

I was wondering if you had an example on how to read ept data from an azure blob container which is accessible only via a SAS token.

I have a pipeline that looks like the following (based on https://gist.github.com/hobu/ee22084e24ed7e3c0d10600798a94c31):

{
    "pipeline": [
        {
            "bounds": "([-10425171.940, -10423171.940], [5164494.710, 5166494.710])",
            "filename": "https://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json",
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
}

Result :

pdal pipeline test.json --debug
(PDAL Debug) Debugging...
(pdal pipeline readers.ept Debug) PDAL: readers.ept: Could not read from <AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json

Which is completely expected as i have not provided the reader with the sas token to access the file.

So i tried a very naive approach where i provieded the sas token to the filename directly as shown below:

{
    "pipeline": [
        {
            "bounds": "([-10425171.940, -10423171.940], [5164494.710, 5166494.710])",
            "filename": "https://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json?<SAS_TOKEN>",
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
}

Result :

pdal pipeline test.json --debug
(PDAL Debug) Debugging...
(pdal pipeline readers.ept Debug) Query bounds: ([-10425171.94, -10423171.94], [5164494.71, 5166494.71], [-1.797693134862316e+308, 1.797693134862316e+308])
Threads: 15
(pdal pipeline Debug) Executing pipeline in stream mode.
PDAL: readers.ept: Could not read from <AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept-hierarchy/0-0-0-0.json

Which i am guessing is completely normal as well as it can read the ept.json but i think i need to provide the sas token to the request in the pipeline via the header key word, but honestly i do not really understand how to format the pipeline for it to work. I found a release note mentioning #3496 that says that azure sas token added thanks to arbiter. If someone could help out with a comprehensive example it would be great. And as a bonus how to go about using the C++ api to read ept files from an azure blob container using pdal.

trns1997 commented 1 month ago

After looking into vendor/arbiter/arbiter.cpp, i found that we pass the header as string but i still cannot get the formatting right :'). Here is what the json looks like:

{
    "pipeline": [
        {
            "bounds": "([-10425171.940, -10423171.940], [5164494.710, 5166494.710])",
            "filename": "az://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json",
            "header":"{\"az\": \"{\"account\": \"<AZURE_STORAGE_ACCOUNT>\"}\"}",
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
}

It does not seem to like the header that i have passing it. Any clues to why?

On the other hand I looked into the code and noticed that i can set the following env variable to by pass the json parsing probleme AZURE_STORAGE_ACCOUNT and AZURE_SAS_TOKEN. After setting these I no longer have a problem with the header as i remove it, but get the following error which has left me slightly perplexed:

pdal pipeline test.json --debug
(PDAL Debug) Debugging...
(pdal pipeline readers.ept Debug) 400: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidResourceName</Code><Message>The specifed resource name contains invalid characters.
RequestId:<ID>
Time:2024-07-19T16:37:46.6214773Z</Message></Error>
PDAL: readers.ept: Could not read from <AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json
trns1997 commented 1 month ago

Alright i found why i was facing a problem with the header here is what the header should look like in the json so that arbiter can parse the string successfully:

{
    "pipeline": [
        {
            "bounds": "([-10425171.940, -10423171.940], [5164494.710, 5166494.710])",
            "filename": "az://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json",
            "header": "{\"az\": \"{\\\"account\\\": \\\"<AZURE_STORAGE_ACCOUNT>\\\", \\\"sas\\\": \\\"<SAS_TOKEN>\\\"}\"}",
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
}

Now i am simply facing:

(pdal pipeline readers.ept Debug) 403: <?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
gui2dev commented 1 month ago

Hi @trns1997,

Don't directly put queries in the filename. Use the query option for readers.ept.

For arbiter documentation, you can either crawl the code of the different drivers or either make a proposal for documenting this.

@hobu, @connormanning I know there is some arbiter documentation in entwine.io. What could be the best destination for such documentation ?

trns1997 commented 1 month ago

Hi @trns1997,

Don't directly put queries in the filename. Use the query option for readers.ept.

For arbiter documentation, you can either crawl the code of the different drivers or either make a proposal for documenting this.

@hobu, @connormanning I know there is some arbiter documentation in entwine.io. What could be the best destination for such documentation ?

@gui2dev I am not sure that i follow? Isn't the query option to be used when i pass an https to filename. In my case i choose to pass via the az driver see below:

{
    "pipeline": [
        {
            "bounds": "([-10425171.940, -10423171.940], [5164494.710, 5166494.710])",
            "filename": "az://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json",
            "header": "{\"az\": \"{\\\"account\\\": \\\"<AZURE_STORAGE_ACCOUNT>\\\", \\\"sas\\\": \\\"<SAS_TOKEN>\\\"}\"}",
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
}

As far as documentation is concerned i have suggested in the following issue https://github.com/connormanning/arbiter/issues/53 to maybe add it to arbiter directly. But i think having an example directly in PDAL is probably worth having, preventing users from searching all over for a simple read using azure / aws storages :).

gui2dev commented 1 month ago

When using az schema, you just need to specify the relative path in the storage account of ept.json file. In your exemple, that would be az://<PATH_TO_EPT>/ept.json. The az configuration should not be in the header option as it is reserved for header that you would forward to your http request. You can configure the az driver by using environment variables : AZURE_STORAGE_ACCOUNT and AZURE_SAS_TOKEN if you're using SAS, AZURE_STORAGE_ACCESS_KEY if you're using storage account key. Another option is to use a configuration file

trns1997 commented 1 month ago

When using az schema, you just need to specify the relative path in the storage account of ept.json file. In your exemple, that would be az://<PATH_TO_EPT>/ept.json. The az configuration should not be in the header option as it is reserved for header that you would forward to your http request. You can configure the az driver by using environment variables : AZURE_STORAGE_ACCOUNT and AZURE_SAS_TOKEN if you're using SAS, AZURE_STORAGE_ACCESS_KEY if you're using storage account key. Another option is to use a configuration file

@gui2dev omg, this is exactly it. Error from my side, i thought i had to specify the blob.core.windows.net but it is not necessary. Well this is perfect. @hobu @connormanning up regarding the documentation, we're open to create a small explaination for future users on how to use the az driver.

trns1997 commented 1 month ago

Connecting Using Arbiter Azure Driver with SAS token:

  1. Export the following 2 environement variables: AZURE_STORAGE_ACCOUNT and AZURE_SAS_TOKEN
  2. Copy the following content to a test_az_driver.json file modifying the necessary content:
    {
    "pipeline": [
        {
            "bounds": "([xmin, xmax], [ymin, ymax])",
            "filename": "az://<PATH_TO_EPT>/ept.json",
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
    }

    Note PATH_TO_EPT is simply the path to container no need to specifiy blob.core.windows.net or azure storage account.

  3. pdal pipeline test_az_driver.json --debug

Connecting using https link, query and SAS token:

  1. Understanding how to decode your SAS token, the following link I found very helpful: https://medium.com/@anandchandrasekaran1996/sas-tokens-decoded-how-to-ensure-data-security-in-azure-blob-storage-efbcfef32f3f
  2. Example SAS token: sp=r&st=2024-03-03T17:30:06Z&se=2024-03-04T01:30:06Z&sip=86.21.251.79&sv=2022-11-02&sr=b&sig=dQkX7R%2BXHrQLP9qiNdS0zMhYNpmQwLW0D86UUrEgGao%3D. Each element of the query is seperated by & and everything the precedes = is the key and what ever follows = is the associated value. Therefore in our case the query will be the following:
    "query":{
                "sp": "r",
                "st": "2024-03-03T17:30:06Z",
                        .
                        .
                        .
                "sig": "dQkX7R%2BXHrQLP9qiNdS0zMhYNpmQwLW0D86UUrEgGao%3D"
            },
  3. Copy the following content to a test_query.json file modifying the necessary content:
    {
    "pipeline": [
        {
            "bounds": "([xmin, xmax], [ymin, ymax])",
            "filename": "https://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_EPT>/ept.json",
            "query":{
                "sp": "val1",
                "st": "val2",
                        .
                        .
                        .
                "sig": "valx"
            },
            "type": "readers.ept",
            "tag": "readdata"
        },
        {
            "filename": "test.laz",
            "inputs": [ "readdata" ],
            "tag": "writerslas",
            "type": "writers.las"
        }
    ]
    }
  4. pdal pipeline test_query.json --debug

Credits @gui2dev

trns1997 commented 3 weeks ago

@gui2dev I have a question which is more or less along the lines of this issue. The readers.ept has a query key which allows us to pass the sas token to our query to read ept data. https://pdal.io/en/latest/stages/readers.las.html I noticed that the readers.las does not have this functionality, which means that i cannot provide "filename": https://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_LAS>.las" to read las data from a remote server. Is there a particular to why? Or just a functionality that needs to be developed? Or the policy is to fetch the las file from the remote server before reading it?

gui2dev commented 3 weeks ago

@gui2dev I have a question which is more or less along the lines of this issue. The readers.ept has a query key which allows us to pass the sas token to our query to read ept data. https://pdal.io/en/latest/stages/readers.las.html I noticed that the readers.las does not have this functionality, which means that i cannot provide "filename": https://<AZURE_STORAGE_ACCOUNT>.blob.core.windows.net/<PATH_TO_LAS>.las" to read las data from a remote server. Is there a particular to why? Or just a functionality that needs to be developed? Or the policy is to fetch the las file from the remote server before reading it?

I don't know your use case precisely, but using reader.las with a remote file will just download it's content to a temporary file then read it. Only one connection. I guess you could try to put your query in the filename to just check if it's supported. But the best approach is to use the az:// as stated before. The readers.ept manages a pool of connections that will fetch the needed content.