Open secfree opened 1 year ago
could you please paste the master's DEBUG log? That would be more clear.
could you please paste the master's DEBUG log? That would be more clear.
Let me share my trace results.
alluxio fs ls
triggers HdfsUnderFileSystem.getStatus
which has the following log
2022-09-26 11:56:38,310 INFO RetryInvocationHandler - org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2076)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1422)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3001)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1228)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:894)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
...
, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over xxxx:8020 after 1 failover attempts. Trying to failover after sleeping for 880ms.
HdfsUnderFileSystem.getStatus
throws the RemoteException
after completing failover attempts
The call stack of HdfsUnderFileSystem.getStatus
is
@alluxio.underfs.hdfs.HdfsUnderFileSystem.getStatus()
at alluxio.underfs.BaseUnderFileSystem.getFingerprint(BaseUnderFileSystem.java:104)
at alluxio.underfs.UnderFileSystemWithLogging$25.call(UnderFileSystemWithLogging.java:568)
at alluxio.underfs.UnderFileSystemWithLogging$25.call(UnderFileSystemWithLogging.java:565)
at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:1212)
at alluxio.underfs.UnderFileSystemWithLogging.getFingerprint(UnderFileSystemWithLogging.java:565)
at alluxio.master.file.InodeSyncStream.lambda$syncExistingInodeMetadata$2(InodeSyncStream.java:599)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
BaseUnderFileSystem.getFingerprint
catches the exception and returns Constants.INVALID_UFS_FINGERPRINT
(https://github.com/Alluxio/alluxio/blob/master/core/common/src/main/java/alluxio/underfs/BaseUnderFileSystem.java#L103)
InodeSyncStream.syncExistingInodeMetadata
deletes the inode of the file because of Constants.INVALID_UFS_FINGERPRINT
(https://github.com/Alluxio/alluxio/blob/master/core/server/master/src/main/java/alluxio/master/file/InodeSyncStream.java#L776)
DefaultFileSystemMaster.listStatus
calls checkLoadMetadataOptions
which throws FileDoesNotExistException
(https://github.com/Alluxio/alluxio/blob/master/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1334)
could you please paste the master's DEBUG log? That would be more clear.
Tested it again and here is the debug log
2022-09-27 14:27:47,567 DEBUG UnderFileSystemWithLogging - Enter: GetFingerprint(path=hdfs://dev2/user/zhaoqun.deng/220927/file.03)
2022-09-27 14:27:47,572 INFO RetryInvocationHandler - org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.
apache.org/sbnn-error
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:108)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2088)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1428)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3051)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1232)
...
, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over xxxx:8020. Trying to failover immediately.
2022-09-27 14:27:48,784 DEBUG BaseUnderFileSystem - Failed fingerprint. path: hdfs://dev2/user/zhaoqun.deng/220927/file.03 error: java.net.ConnectException: Call From xxxx to xxxx:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
2022-09-27 14:27:48,784 DEBUG UnderFileSystemWithLogging - Exit (OK): GetFingerprint(path=hdfs://dev2/user/zhaoqun.deng/220927/file.03) in 1217 ms
2022-09-27 14:27:48,792 WARN FileSystemMasterClientServiceHandler - Exit (Error): ListStatus: request=path: "/dev2_mnt/zhaoqun.deng/220927/file.03"
options {
loadMetadataType: ONCE
commonOptions {
syncIntervalMs: 0
ttl: -1
ttlAction: DELETE
}
recursive: false
loadMetadataOnly: false
}
, Error=alluxio.exception.FileDoesNotExistException: Path "/dev2_mnt/zhaoqun.deng/220927/file.03" does not exist.
Thanks a lot for the very detail information, wondering if there is a way to pass the situation (No permission) in step 4 (https://github.com/Alluxio/alluxio/issues/16236#issuecomment-1257891644)? In that case, this information can be passed back to the up layer explicit. And upper layer can handle it more clear.
wondering if there is a way to pass the situation (No permission) in step 4
Yes, I am working on this way and will raise a PR if it works well.
Tested and confirmed that, as InodeSyncStream.syncExistingInodeMetadata
may delete the inode of the file/directory because of Constants.INVALID_UFS_FINGERPRINT
, if a directory has files not persisted under it, these files may lost.
Tested and confirmed that, as
InodeSyncStream.syncExistingInodeMetadata
may delete the inode of the file/directory because ofConstants.INVALID_UFS_FINGERPRINT
, if a directory has files not persisted under it, these files may lost.
I think this situation should be fixed. What is your fix logic?
I think this situation should be fixed. What is your fix logic?
The fix logic is #16245 , it separates FileNotFoundException and general IOException(for example, ConnectException, StandbyException), and does not delete the inode from alluxio if it is not FileNotFoundException.
The solution could not be just that simple, the following scenarios should be considered: (metadata update_to_data/stale, UFS reachable/not_reachable, no_sync/sync, Persist/To_be_persist),
are all the combination covered? A simple example is (stale, ufs not reachable, sync), what would be turned and how to handle the data in alluxio layer?
If UFS reachable, then the logic is not affected by the PR.
If UFS not_reachable
That is the difference of the underlying philosophy.
One side effect of the PR is that the client may get stale data if the underlying data is already updated/deleted when UFS is not reachable. I think it's better than deleting the data in alluxio directly.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
Alluxio Version: 2.7.1
Describe the bug The file exists in Alluxio, but the client gets FileDoesNotExistException from AlluxioMaster when AlluxioMaster cannot access UFS.
To Reproduce
Access the path in alluxio again and get the FileDoesNotExistException
Expected behavior Alluxio returns the existing status, but not throw FileDoesNotExistException.
Urgency
Urgent as this may cause the user think there is data loss (the file is missing).
Are you planning to fix it Yes