Closed maroony closed 1 week ago
Hi @maroony, this sounds like a dependency version mismatch (e.g. you see NoClassDefFoundError, NoSuchMethodError or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:
verbose dependency tree (mvn dependency:tree -Dverbose
)
Thanks!
Hi @maroony. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
@joshfree I update my initial post with the dependency tree. I already read the troubleshooting guide and checked some things afterwards.
I think the root cause of the problem is the usage of
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>adal4j</artifactId>
<version>1.6.5</version>
</dependency>
in sdk
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-batch</artifactId>
<version>11.0.0</version>
</dependency>
The adal4j
artifact is no longer maintained and should be replaced by
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>msal4j</artifactId>
<version>1.13.8</version>
</dependency>
@maroony
It appears you're trying to use track 1 azure-batch together with track 2 azure-resourcemanager and azure-identity They bring in different sets of dependencies into the application for oauth2 sdk resulting in a conflict.
azure-identity uses msal4j while azure-batch uses adal4j.
Potential mitigation routes can be:
Looking at your use case 2 might be the best possible solution. If you need assistance with following route 2, let me know, I can try to help.
@g2vinay Thanks for your response! I could not drop the library azure-resourcemanager
and azure-identity
because I need them to create different objects before using azure batch.
I could try proposal 2 and 3. But this is not what I expected as a user of the SDK's. If shading is working right now, there is now assurance that this will work in the future. We need something reliable because this program will be used in an production environment in order to run important recurring work loads.
The azure batch client sdk is using the deprecated library adal4j
. Will this library be replaced in the future? Is there a roadmap or something?
@maroony
Thank you for the feedback. The ideal solution to this would be to have a T2 azure-batch sdk that works with azure-identity SDK (msal4j underneath). I am looping in Azure Batch SDK owners @jingjlii @ljiaqi1998 @JJJessieWang @dpwatrous @NickKouds to comment on the roadmap for that.
On my end, I can provide full assistance to help you get unblocked with options 2 or 3. You can first try out option 3, if that doesn't work out, then I can help you create a shaded version of azure-batch sdk that is reliable for your production environment.
@g2vinay
At the moment, I'm trying option 3, pinning "com.nimbusds:oauth2-oidc-sdk:jar:6.5. It seems to work to a specific point. But I'm not able to startup up a node with mounted blob storage data:
storageAccount = azureResourceManager.storageAccounts()
.define(...)
.withRegion(...)
.withExistingResourceGroup(.)
.withBlockBlobStorageAccountKind()
.withSku(StorageAccountSkuType.PREMIUM_LRS)
.withAccessFromNetworkSubnet(...)
.withAccessFromIpAddressRange(...)
.create();
blobContainer = azureResourceManager.storageBlobContainers()
.defineContainer(...)
.withExistingStorageAccount(storageAccount)
.withPublicAccess(PublicAccess.NONE)
.create();
NetworkConfiguration networkConfiguration = new NetworkConfiguration()
.withSubnetId(...);
AzureBlobFileSystemConfiguration azureBlobFileSystemConfiguration = new AzureBlobFileSystemConfiguration()
.withAccountName(...)
.withAccountKey(...)
.withContainerName(...)
.withRelativeMountPath("foo");
MountConfiguration mountConfiguration = new MountConfiguration()
.withAzureBlobFileSystemConfiguration(azureBlobFileSystemConfiguration);
PoolAddParameter poolAddParameter = new PoolAddParameter()
.withId(...)
.withVmSize(...)
.withNetworkConfiguration(networkConfiguration)
.withTaskSlotsPerNode(...)
.withVirtualMachineConfiguration(virtualMachineConfiguration)
.withMountConfiguration(List.of(mountConfiguration))
.withEnableAutoScale(true)
.withAutoScaleFormula(autoScaleFormula)
.withAutoScaleEvaluationInterval(new Period(0, 5, 0, 0));
batchClient.poolOperations().createPool(poolAddParameter);
With this approach, the node stucks in state "starting" forever - no error log or something. When I remove the networkconfiguration
, the node starts but my data was not mounted, because of an access error as expected. So whats going on here? When I check the pool's mount configuration in the azure portal, the column "Account key" is empty. I don't know if this could be the problem. As you can see, I configured the account key.
For mgmt, you may want to enable logging. And check the log, see whether the JSON request/response is expected.
This is not helpful for my problem, because the pool stucks in starting state and my other objects (storage account, container registry) were created as expected.
Sorry, I see. So code hangs on line batchClient.poolOperations().createPool(poolAddParameter);
, which is the track1 batch data-plane code.
To be precise, my java program stucks in state "waiting for tasks to be finished". So the line batchClient.poolOperations().createPool(poolAddParameter);
is executed. After that I'm adding a job with tasks. And yes, all this uses the batch client sdk (track 1 code).
However, my program will never finish because it never gets a single usable node to run the tasks.
What I am thinking is: the batchClient
basically send and receive REST API calls. If your code is executed and control returned to your app, it would mean REST API call is completed successfully, hence backend should have received and acknowledged your "create pool" request.
Logging from the client should help you find whether there is problem with the JSON of the request (it can be compared with your working full track1 app). If there is problem, there maybe caused by Jackson (the one does the serialization of JSON, and track1 / track2 depends on different version of it) when preparing your REST call. If there is no problem in JSON, client likely be good.
However, currently I didn't find where to enable logging in track1 batch client (sorry, not familiar with that client).
One alternative, is call client.poolOperations().getPool(...)
(when you done the configure, but not yet wait for the job/tasks that ever get started) to see whether you've created the pool, and whether the data in the response is expected.
Or you can use Portal or Azure CLI to inspect the pool (if applicable).
loop in @NickKouds, who be more familiar with batch client https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/batch/microsoft-azure-batch
Actually, this is exactly what do after creating the pool:
pool = batchClient.poolOperations().getPool(...);
Because of this and because I also checked the pool in the portal, the pool creation itself is working fine so far. I checked the pool configuration in the portal and see that autoscaling is working and so on. In the portal I can also see, that a created node never leaves the starting state.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @mksuni @bgklein @mscurrell @cRui861 @paterasMSFT @gingi @dpwatrous.
That is strange. Seems client has done everything correct, if portal and client API both gives good result. But the node does not start.
I am looping the backend as well.
Clarification for backend eng: the current problem is about batch data-plane.
A short summary: In the meantime, I found the root cause of this problem: In order to mount the blob storage on the compute node, a software for this operation is necessary. Because this software is not part of the vm image ubuntu-server-container, the batch service is trying to download it on the node. However, my nodes aren't allowed to connect to the internet. So the download failed. @ Karl Tietze (MS Support): Thanks for looking into this!
It takes a few hours for the batch service in order to recognising this and throw an error! I think this should be improved in the future. Because I need the storage mount, I'll build a custom image for the vm's including the software.
Closing this as it sounds like you have found a workaround, and the root issue for the dependency issue (adal4j
and the mixture of microsoft-
and azure-
packages) is understood.
Describe the bug I want to create a pool with a network configuration. Because of this, I have to authenticate through AAD. When I try to create the batch pool I got an
NoClassDefFoundError: com/nimbusds/oauth2/sdk/http/CommonContentTypes
.Exception or Stack Trace
Code Snippet
Setup: