Open fivetran-arunsuri opened 6 months ago
@ibrahimrabab @ibrandes @seanmcc-msft
Thank you for your feedback. Tagging and routing to the team member best able to assist.
Hi @fivetran-arunsuri , thanks for reporting this issue. Taking a quick look at the sample provided:
DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(adlsFolderPath);
RETRIER.get(() -> {
DataLakeFileClient fileClient = directoryClient.createFile(fileToUpload.getName(), true);
fileClient.uploadFromFile(fileToUpload.getPath(), true);
return fileClient.getFileName();
}, MAX_ATTEMPTS);
Calling both
directoryClient.createFile(fileToUpload.getName(), true);
and
fileClient.uploadFromFile(fileToUpload.getPath(), true);
will result in a CreateFilePath
REST operation as DataLakeDirectoryClient.createFile
and DataLakeFileClient.uploadFromFile
will create a DataLake file. Changing to DataLakeDirectoryClient.getFileClient
will remove one of the first two CreateFilePath
REST operations you're seeing.
Still looking into why there is a CreateFilePath
REST operation after uploading and why the LastModifed
value is Monday, 01-Jan-01 00:00:00 GMT
. Is there any code after this which is making additional DataLake calls?
Hi @fivetran-arunsuri. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
hey alzimmermsft I understood the fix suggested by you to remove one of the 2 createPathFile operation but I am interested to know answers of why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. No We are not making upload related calls after this
alzimmermsft any update on this?
hey alzimmermsft I understood the fix suggested by you to remove one of the 2 createPathFile operation but I am interested to know answers of why there is a CreateFilePath REST operation after uploading and why the LastModifed value is Monday, 01-Jan-01 00:00:00 GMT. No We are not making upload related calls after this
Still haven't found a root cause on this as the uploadFromFile
code flow calls create
before running all necessary append
operations to upload the file and concludes with the flush
to finalize the uploaded data.
And does this happen every time or randomly during the application run?
It does not happen everytime. It is an intermittent issue but it always occurs at times
If possible, could you produce HTTP logs when the scenario is hit?
DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
// add credential information here
// endpoint here
.httpLogOptions(DataLakeServiceClientBuilder.getDefaultHttpLogOptions().setLogLevel(HttpLogDetailLevel.HEADERS))
.buildClient();
DataLakeDirectoryClient directoryClient = serviceClient.getFileSystemClient(fileSystemName).getDirectoryClient(adlsFolderPath);
// Retrier code here
This will help find if the empty file creation correlates with a retry, which would be a bug in this case.
Still can't reproduce the Last-Modified time issue as I've tried creating empty files, files that are created with a single append, and files created with multiple appends before flushing and still haven't hit this case yet. FYI @seanmcc-msft if you know anything about this edge case problem.
alzimmermsft where will these logs will be produced in this case? I mean if we create the service client like this, should this be part of client logs
alzimmermsft I am not able to reproduce the logs locally while running a dummy application. Can you tell me how we can repro this. Do we need a separate logging framework for same?
Hi @fivetran-arunsuri, to enable logging you don't explicitly need a separate logging framework as the SDK will use a fallback to System.out
if there isn't an SLF4J implementation on the classpath, but for configurability to a log file I'd recommend using slf4j-simple
and specifying a log file.
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.36</version>
</dependency>
Run with something like
java -Dorg.slf4j.simpleLogger.defaultLogLevel=DEBUG -Dorg.slf4j.simpleLogger.logFile=logs.txt -jar application.jar
(Replacing logs.txt
with whatever you want as the output file)
To enable HTTP request and response logging in DataLake all you'll need to change is either passing -DAZURE_HTTP_LOG_DETAIL_LEVEL=headers
when running the Java JAR or change code with what I shared in my last comment.
DataLakeServiceClient serviceClient = new DataLakeServiceClientBuilder()
// add credential information here
// endpoint here
.httpLogOptions(DataLakeServiceClientBuilder.getDefaultHttpLogOptions().setLogLevel(HttpLogDetailLevel.HEADERS))
.buildClient();
alzimmermsft, We have enabled the logging on client side. We will update with the results, Seems like logging was already enabled and we were able to see file getting uploaded correctly I would like to followup on my previous questions on how could we see the createPathFile operation after the file is uploaded and why does appendFIle operation always show the Last Modified time as Monday, 01-Jan-01 00:00:00 GMT
Hi @fivetran-arunsuri, for scenarios when the storage diagnostics have the extra createPathFile
could you share the client side logs (check to make sure everything is properly redacted before sharing) so we can validate whether this is an improper client-side or server-side behavior. From there we should be able to determine why this odd behavior is cropping up. This is needed as I did a review of the code paths called client-side and we shouldn't see this happening but there is some multi-threading going on which makes this a bit harder to pinpoint.
FYI @seanmcc-msft if we need to pull in service folks.
Hi alzimmermsft Can you please tell me what additional work these client side logs will do. We already enabled the diagnostic logs from customer's account and all these logs are from same client ips. We also enabled some logs to check the uploaded file size, it showed finite response after uploads. I would like to know the benefit of these logs in our case.
I shared the code snippet which we are using right now(attached in description), after fileClient.uploadFromFile
, we are not doing anything. Also if retries could be the issue with this api, then all 3 operations could have been triggered as client are not controlling the API's like createPathFile
Describe the bug Following code is used to upload the files to Azure Data lake storage container(retrier retries the upload in case of failure). It internally involves, CreatePathFile, Append File, FlushFile rest API operations. We have enabled the diagnostic logs on the container and in some cases we saw following operations are performed in following order
There are multiple createPathFile operations being performed here for the same file. It should not happen ideally because if just createPathFile will be triggered once the file has data, it will turn it into zero bytes file.
Also, we noticed the LastModified time in logs for append file operation always shows
Monday, 01-Jan-01 00:00:00 GMT
, which looks incorrect to me. Can you please help in this case?Exception or Stack Trace
To Reproduce
Code Snippet DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(adlsFolderPath); RETRIER.get( () -> { DataLakeFileClient fileClient = directoryClient.createFile(fileToUpload.getName(), true); fileClient.uploadFromFile(fileToUpload.getPath(), true); return fileClient.getFileName(); }, MAX_ATTEMPTS);
Expected behavior It should just upload the file with finite size
Setup (please complete the following information): "com.azure:azure-storage-file-datalake:12.18.1", "com.azure:azure-storage-common:12.24.1",