Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.3k stars 1.96k forks source link

[BUG] Multiple createPathFile internal operations causing uploadFIle api to convert finite size file to zero bytes file in ADLS gen2 #40235

Open fivetran-arunsuri opened 3 months ago

fivetran-arunsuri commented 3 months ago

Describe the bug Following code is used to upload the files to Azure Data lake storage container(retrier retries the upload in case of failure). It internally involves, CreatePathFile, Append File, FlushFile rest API operations. We have enabled the diagnostic logs on the container and in some cases we saw following operations are performed in following order

image

There are multiple createPathFile operations being performed here for the same file. It should not happen ideally because if just createPathFile will be triggered once the file has data, it will turn it into zero bytes file.

Also, we noticed the LastModified time in logs for append file operation always shows Monday, 01-Jan-01 00:00:00 GMT, which looks incorrect to me. Can you please help in this case?

Exception or Stack Trace

To Reproduce

Code Snippet DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(adlsFolderPath); RETRIER.get( () -> { DataLakeFileClient fileClient = directoryClient.createFile(fileToUpload.getName(), true); fileClient.uploadFromFile(fileToUpload.getPath(), true); return fileClient.getFileName(); }, MAX_ATTEMPTS);

Expected behavior It should just upload the file with finite size

Setup (please complete the following information): "com.azure:azure-storage-file-datalake:12.18.1", "com.azure:azure-storage-common:12.24.1",

github-actions[bot] commented 3 months ago

@ibrahimrabab @ibrandes @seanmcc-msft

github-actions[bot] commented 3 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

alzimmermsft commented 3 months ago

Hi @fivetran-arunsuri , thanks for reporting this issue. Taking a quick look at the sample provided:

DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(adlsFolderPath);
RETRIER.get(() -> {
    DataLakeFileClient fileClient = directoryClient.createFile(fileToUpload.getName(), true);
    fileClient.uploadFromFile(fileToUpload.getPath(), true);
    return fileClient.getFileName();
}, MAX_ATTEMPTS);

Calling both

directoryClient.createFile(fileToUpload.getName(), true); and fileClient.uploadFromFile(fileToUpload.getPath(), true);

will result in a CreateFilePath REST operation as DataLakeDirectoryClient.createFile and DataLakeFileClient.uploadFromFile will create a DataLake file. Changing to DataLakeDirectoryClient.getFileClient will remove one of the first two CreateFilePath REST operations you're seeing.

Still looking into why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. Is there any code after this which is making additional DataLake calls?

github-actions[bot] commented 3 months ago

Hi @fivetran-arunsuri. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

fivetran-arunsuri commented 3 months ago

hey alzimmermsft I understood the fix suggested by you to remove one of the 2 createPathFile operation but I am interested to know answers of why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. No We are not making upload related calls after this

fivetran-arunsuri commented 3 months ago

alzimmermsft any update on this?

alzimmermsft commented 3 months ago

hey alzimmermsft I understood the fix suggested by you to remove one of the 2 createPathFile operation but I am interested to know answers of why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. No We are not making upload related calls after this

Still haven't found a root cause on this as the uploadFromFile code flow calls create before running all necessary append operations to upload the file and concludes with the flush to finalize the uploaded data.

And does this happen every time or randomly during the application run?

fivetran-arunsuri commented 3 months ago

It does not happen everytime. It is an intermittent issue but it always occurs at times

alzimmermsft commented 3 months ago

If possible, could you produce HTTP logs when the scenario is hit?

https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/core/azure-core#http-request-and-response-logging

DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
  // add credential information here
  // endpoint here
  .httpLogOptions(DataLakeServiceClientBuilder.getDefaultHttpLogOptions().setLogLevel(HttpLogDetailLevel.HEADERS)) 
  .buildClient();

DataLakeDirectoryClient directoryClient = serviceClient.getFileSystemClient(fileSystemName).getDirectoryClient(adlsFolderPath);

// Retrier code here

This will help find if the empty file creation correlates with a retry, which would be a bug in this case.

Still can't reproduce the Last-Modified time issue as I've tried creating empty files, files that are created with a single append, and files created with multiple appends before flushing and still haven't hit this case yet. FYI @seanmcc-msft if you know anything about this edge case problem.

fivetran-arunsuri commented 3 months ago

alzimmermsft where will these logs will be produced in this case? I mean if we create the service client like this, should this be part of client logs

fivetran-arunsuri commented 2 months ago

alzimmermsft I am not able to reproduce the logs locally while running a dummy application. Can you tell me how we can repro this. Do we need a separate logging framework for same?

alzimmermsft commented 2 months ago

Hi @fivetran-arunsuri, to enable logging you don't explicitly need a separate logging framework as the SDK will use a fallback to System.out if there isn't an SLF4J implementation on the classpath, but for configurability to a log file I'd recommend using slf4j-simple and specifying a log file.

<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-simple</artifactId>
  <version>1.7.36</version>
</dependency>

Run with something like

java -Dorg.slf4j.simpleLogger.defaultLogLevel=DEBUG -Dorg.slf4j.simpleLogger.logFile=logs.txt -jar application.jar

(Replacing logs.txt with whatever you want as the output file)

To enable HTTP request and response logging in DataLake all you'll need to change is either passing -DAZURE_HTTP_LOG_DETAIL_LEVEL=headers when running the Java JAR or change code with what I shared in my last comment.

DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
  // add credential information here
  // endpoint here
  .httpLogOptions(DataLakeServiceClientBuilder.getDefaultHttpLogOptions().setLogLevel(HttpLogDetailLevel.HEADERS)) 
  .buildClient();
fivetran-arunsuri commented 2 months ago

alzimmermsft, We have enabled the logging on client side. We will update with the results, Seems like logging was already enabled and we were able to see file getting uploaded correctly I would like to followup on my previous questions on how could we see the createPathFile operation after the file is uploaded and why does appendFIle operation always show the Last Modified time as Monday, 01-Jan-01 00:00:00 GMT

alzimmermsft commented 1 month ago

Hi @fivetran-arunsuri, for scenarios when the storage diagnostics have the extra createPathFile could you share the client side logs (check to make sure everything is properly redacted before sharing) so we can validate whether this is an improper client-side or server-side behavior. From there we should be able to determine why this odd behavior is cropping up. This is needed as I did a review of the code paths called client-side and we shouldn't see this happening but there is some multi-threading going on which makes this a bit harder to pinpoint.

FYI @seanmcc-msft if we need to pull in service folks.

fivetran-arunsuri commented 1 month ago

Hi alzimmermsft Can you please tell me what additional work these client side logs will do. We already enabled the diagnostic logs from customer's account and all these logs are from same client ips. We also enabled some logs to check the uploaded file size, it showed finite response after uploads. I would like to know the benefit of these logs in our case. I shared the code snippet which we are using right now(attached in description), after fileClient.uploadFromFile , we are not doing anything. Also if retries could be the issue with this api, then all 3 operations could have been triggered as client are not controlling the API's like createPathFile