Azure / spark-cdm-connector

MIT License
75 stars 32 forks source link

Overriding from configPath - sample problem #62

Closed kcris closed 3 years ago

kcris commented 3 years ago

Hi, I tried to execute sample [5] (Overriding from configPath) available here: https://github.com/Azure/spark-cdm-connector/blob/master/samples/SparkCDMsamplePython.ipynb

these lines have some issues

  .option("entityDefinitionModelRoot", "Models")   # fetches config.json from this location and finds definition of "core" alias, if configPath option is not present
  .option("configPath" "/config")  # Add your config.json to override the above definition
  .option("entityDefinitionStorage", "<storage1>.dfs.core.windows.net") // entityDefinitionModelRoot contains in this storage account
  1. based on previous examples I think .option("entityDefinitionModelRoot", "Models") should really be .option("entityDefinitionModelRoot", container+"Models")

  2. there is a missing comma here .option("configPath" "/config")

  3. python comment should start with # here: // entityDefinitionModelRoot contains in this storage account

All these are minor issues. Once fixed, the main problem was that:

  1. I was not able to find Config.json inside /config. It is not clear if /config is relative to entityDefinitionModelRoot, but if that's the case, the Config.json was present when I tested and yet it was not found. So I was unable to run this sample. I tried both Config.json (according to error messages) and config.json (according to provided sample)

Please take a look

srichetar commented 3 years ago

Hi, Thanks for letting us know about minor typos in the python sample notebook. There is a config.json in https://github.com/Azure/spark-cdm-connector/tree/master/samples/Contacts. Please read - https://github.com/Azure/spark-cdm-connector/blob/master/documentation/overview.md#explicit-write-options. configPath should be an absolute path eg: - /<conatinername>/<foldername_ where_config.json_resides>

kcris commented 3 years ago

Thanks! I think I have one more problem though: there is a missing dependency: CustomerCategory.cdm.json schema is not part of the samples/Contacts folder (it is referenced by _salesimports.cdm.json)

bissont commented 3 years ago

Can you try again. I'm not sure how we missed that.

kcris commented 3 years ago

Thanks!

my next issue is that TrackedEntity is not found. As a note, I copied the whole contents of samples/Contacts to my datalake container, which includes a TrackedEntity.cdm.json.

the error I get:

: java.util.concurrent.ExecutionException: java.lang.Exception: PersistenceLayer | Could not read '/TrackedEntity.cdm.json' from the 'core' namespace. Reason 'com.microsoft.commondatamodel.objectmodel.storage.StorageAdapterException: Could not read ADLS content at path: /TrackedEntity.cdm.json' | loadDocumentFromPathAsync

I see this inside _salesimports.cdm.json:

    "corpusPath": "core:/TrackedEntity.cdm.json"

and config.json is the original one in this repo

{
  "defaultNamespace" : "adls",
  "adapters" : [
    {
      "type" : "adls",
      "config" : {
        "hostname" : "srichetastorage.dfs.core.windows.net",
        "root" : "/outputsubmanifest/example-public-standards",
        "tenant" : "72f988bf-86f1-41af-91ab-2d7cd011db47",
        "clientId" : "6c3f525f-bdcb-4677-bed6-24f0b43add13",
        "timeout" : 5000,
        "maximumTimeout" : 20000,
        "numberOfRetries" : 2
      },
      "namespace" : "core"
    }
  ]
}

so I am not sure what is the problem.

my datalake container's contents:

image

the python code I used:

storageAccountName = "<mystorage>.dfs.core.windows.net"
container = "wwi-02"
outputContainer = "wwi-02"

(customerdf.write.format("com.microsoft.cdm")
  .option("storage", storageAccountName)
  .option("manifestPath", outputContainer + "/test/cdm/customer/default.manifest.cdm.json")
  .option("entity", "TestEntity")
  .option("entityDefinitionModelRoot", container + "/test/cdm/Models")   # fetches Config.json from this location and finds definition of "core" alias, if configPath option is not present
  .option("entityDefinitionPath", "/Contacts/Customer.cdm.json/Customer")  # Customer.cdm.json has an alias - "core"
  .option("configPath", container + "/test/cdm/Models/Contacts")  # Add your Config.json to override the above definition
  .option("entityDefinitionStorage", storageAccountName) # entityDefinitionModelRoot contains in this storage account
  .option("format", "parquet")
  .save())

The core alias inside config.json points to srichetastorage.dfs.core.windows.net/outputsubmanifest/example-public-standards. so I guess that, when overriding config, that's where TrackedEntity.cdm.json is being looked up.

Is this the source of the problem?

As a note: there is a local copy of TrackedEntity.cdm.json too.

Thank you!

srichetar commented 3 years ago

You need to change the location as per your needs, the location where TrackedEntity is placed.

"config" : {
        "hostname" : "srichetastorage.dfs.core.windows.net",
        "root" : "/outputsubmanifest/example-public-standards",
        "tenant" : "72f988bf-86f1-41af-91ab-2d7cd011db47",
        "clientId" : "6c3f525f-bdcb-4677-bed6-24f0b43add13",
        "timeout" : 5000,
        "maximumTimeout" : 20000,
        "numberOfRetries" : 2
      },

The TrackedEntity.cdm.json inturn has

    "imports": [
        {
            "corpusPath": "cdm:/foundations.cdm.json"
        }
    ],

You need the CDM foundation files to get this working. The sample file just tells you how to use the options.

kcris commented 3 years ago

got it, thanks a lot