Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.86k stars 2.94k forks source link

File LOST,not found in ceph and allxuio #14542

Open lilyzhoupeijie opened 2 years ago

lilyzhoupeijie commented 2 years ago

Alluxio Version: alluxio 2.6.1 async through

Describe the bug alluxio fs stat shows: some blockInfo with no locations BlockInfo{id=828677029907, length=16777216, locations=[]} alluxio fs du -s -h /de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz File Size In Alluxio Path 2020.58MB 1959.97MB (96%) /de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz

and only 900M files in ceph

log we can see: 2021-11-11 04:50:55,248 INFO AUDIT_LOG (AsyncUserAccessAuditLogWriter.java:run) - succeeded=true allowed=true ugi=root,root (AUTH=SIMPLE) ip=/172.29.166.123:55204 cmd=getFileInfo src=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz dst=null perm=lvshaohai_write:owner:rw-r--r-- executionTimeUs=3946

2021-11-16 10:04:11,800 WARN DefaultFileSystemMaster - Exception encountered when trying to retrieve the status of a persist job (id=1636131958121) for file /de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz (id=828693807103): alluxio.exception.status.NotFoundException: The job of id 1,636,131,958,121 does not exist.

2021-11-16 10:04:11,800 WARN DefaultFileSystemMaster - Exception encountered when trying to retrieve the status of a persist job (id=1636131958121) for file /de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz (id=828693807103): alluxio.exception.status.NotFoundException: The job of id 1,636,131,958,121 does not exist.

2021-11-16 10:06:27,770 INFO AUDIT_LOG (AsyncUserAccessAuditLogWriter.java:run) - succeeded=true allowed=true ugi=root,root (AUTH=SIMPLE) ip=/172.29.255.240:58226 cmd=getFileInfo src=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz dst=null perm=lvshaohai_write:owner:rw-r--r-- executionTimeUs=3084

2021-11-16 10:06:27,766 INFO PlanCoordinator - Starting job Id=1637056731602 Config=PersistConfig{filePath=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, mountId=2189049575140975422, overwrite=false, ufsPath=s3://de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz}

2021-11-16 10:40:18,253 INFO PlanCoordinator - Starting job Id=1637058387510 Config=PersistConfig{filePath=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, mountId=2189049575140975422, overwrite=false, ufsPath=s3://de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz}

2021-11-16 10:40:26,652 INFO PlanCoordinator - Starting job Id=1637058402695 Config=ReplicateConfig{blockId=828677029962, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,653 INFO PlanCoordinator - Starting job Id=1637058402696 Config=ReplicateConfig{blockId=828677029963, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,653 INFO PlanCoordinator - Starting job Id=1637058402697 Config=ReplicateConfig{blockId=828677029961, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,654 INFO PlanCoordinator - Starting job Id=1637058402698 Config=ReplicateConfig{blockId=828677029964, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,654 INFO PlanCoordinator - Starting job Id=1637058402699 Config=ReplicateConfig{blockId=828677029907, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,654 INFO PlanCoordinator - Starting job Id=1637058402700 Config=ReplicateConfig{blockId=828677029910, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,655 INFO PlanCoordinator - Starting job Id=1637058402701 Config=ReplicateConfig{blockId=828677029908, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 10:40:26,656 INFO PlanCoordinator - Starting job Id=1637058402702 Config=ReplicateConfig{blockId=828677029909, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:46:40,164 INFO PlanCoordinator - Job completed Id=1637058402699 Config=ReplicateConfig{blockId=828677029907, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:46:43,832 INFO PlanCoordinator - Job completed Id=1637058402695 Config=ReplicateConfig{blockId=828677029962, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:47:00,023 INFO PlanCoordinator - Job completed Id=1637058402697 Config=ReplicateConfig{blockId=828677029961, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:47:04,639 INFO PlanCoordinator - Job completed Id=1637058402698 Config=ReplicateConfig{blockId=828677029964, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:47:06,023 INFO PlanCoordinator - Job completed Id=1637058402701 Config=ReplicateConfig{blockId=828677029908, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:47:16,138 INFO PlanCoordinator - Job completed Id=1637058402696 Config=ReplicateConfig{blockId=828677029963, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:47:41,968 INFO PlanCoordinator - Job completed Id=1637058402700 Config=ReplicateConfig{blockId=828677029910, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:48:19,439 INFO PlanCoordinator - Job completed Id=1637058402702 Config=ReplicateConfig{blockId=828677029909, path=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:50:40,139 INFO PlanCoordinator - Job failed Id=1637058387510 Config=PersistConfig{filePath=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, mountId=2189049575140975422, overwrite=false, ufsPath=s3://de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz} Error=io exception (Zero Copy GrpcDataReader)

2021-11-16 11:50:43,773 INFO PlanCoordinator - Starting job Id=1637058451462 Config=PersistConfig{filePath=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, mountId=2189049575140975422, overwrite=false, ufsPath=s3://de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz}

2021-11-16 11:56:58,935 INFO PlanCoordinator - Job failed Id=1637058451462 Config=PersistConfig{filePath=/de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz, mountId=2189049575140975422, overwrite=false, ufsPath=s3://de2-tf-sample-geek/de2/geek/2021-11-06/geek_48.tfrecord.gz} Error=Timeout waiting for response after 300000ms. clientClosed: false clientCancelled: false serverClosed: false (Zero Copy GrpcDataReader)

lilyzhoupeijie commented 2 years ago

two problems found: 1)I have pin the file,why there is EvictJob? 2)the EvictJob happens too frequently,however the removeBlock function have no lock,so i think the block was deleted abnormally

+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ k8s_pod message +-----------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ de2-tf-sample-geek-master-0 2021-11-16 11:47:08,039 INFO PlanCoordinator - Starting job Id=1637058451169 Config=EvictConfig{blockId=828677029907, replicas=1}
de2-tf-sample-geek-master-0 2021-11-16 11:48:10,029 INFO PlanCoordinator - Starting job Id=1637058451280 Config=EvictConfig{blockId=828677029907, replicas=1}
de2-tf-sample-geek-master-0 2021-11-16 11:50:09,074 INFO PlanCoordinator - Starting job Id=1637058451444 Config=EvictConfig{blockId=828677029907, replicas=1}

+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------------+

lilyzhoupeijie commented 2 years ago

found that one job worker was oomKilled at 2021-11-16T11:50:33Z

lilyzhoupeijie commented 2 years ago
mysql> select k8s_pod,message from k8s_tcloud.log_arsenal where k8s_pod like 'de2-tf-sample-geek-worker%' and message like '%1637058451280%' ; +--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ k8s_pod message +--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ de2-tf-sample-geek-worker-6b6d69f96d-2qljs 2021-11-16 11:51:44,083 INFO TaskExecutorManager - Task 0 for job 1637058451280 failed: NOT_FOUND: BlockMeta not found for blockId 828677029907

+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (1.65 sec) Read 110449121 rows, 22.05 GiB in 1.643392543 sec., 67207996 rows/sec., 13.42 GiB/sec.

mysql> select k8s_pod,message from k8s_tcloud.log_arsenal where k8s_pod like 'de2-tf-sample-geek-worker%' and message like '%1637058451444%' ; +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ k8s_pod message +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ de2-tf-sample-geek-worker-6b6d69f96d-nn25t 2021-11-16 11:50:09,462 INFO CommandHandlingExecutor - Received run task 0 for job 1637058451444 on worker 1637058375703

+--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ 1 row in set (2.25 sec) Read 110474645 rows, 22.05 GiB in 2.252720428 sec., 49040548 rows/sec., 9.79 GiB/sec.

mysql> select k8s_pod,message from k8s_tcloud.log_arsenal where k8s_pod like 'de2-tf-sample-geek-worker%' and message like '%1637058451169%' ; Empty set (1.67 sec) Read 110492935 rows, 22.05 GiB in 1.668606572 sec., 66218686 rows/sec., 13.21 GiB/sec.

lilyzhoupeijie commented 2 years ago
图片 图片
lilyzhoupeijie commented 2 years ago

I guess that: 1)firstly,file write takes a long time,and the ReplicationChecker found that uncompleted file,it stat a replica job.(I have correct the uncompleted file not to start replica job,but it still on test,not online) 2)the replica job arrived the job worker,and right that moment ,the file truned to be completed,the files finish writting,and the block has two replicas. 3)then the ReplicationChecker found the block has two replicas(it should be one),so it stat a evict job to delete the redundant block. 4)when delete the redundant block,the job worker process was killed because of oom,the evict job was started for serval times,and finallly causing the two replicas all deleted,the data lost.

yuzhu commented 2 years ago

The root cause here is evict operation and potentially other job service operations are not idempotent. We should change the semantics from evict(3 copies) to evictDownTo(2 copies), similarly for replicate jobs.

@jja725 do you mind taking a look at this? we will also need to change the call site of evict and replicate jobs too.

lilyzhoupeijie commented 2 years ago

file completed at 2021-11-09 17:38:23,402 INFO AUDIT_LOG (AsyncUserAccessAuditLogWriter.java:run) - succeeded=true allowed=true ugi=lvshaohai_write,owner (AUTH=SIMPLE) ip= cmd=completeFile src=//de2/geek/2021-11-06/geek_48.tfrecord.gz dst=null perm=lvshaohai_write:owner:rw-r--r-- executionTimeUs=1378 but it start replica at 11-16 2021-11-16 10:40:26,654 INFO PlanCoordinator - Starting job Id=1637058402699 Config=ReplicateConfig{blockId=828677029907, path=//de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:46:40,164 INFO PlanCoordinator - Job completed Id=1637058402699 Config=ReplicateConfig{blockId=828677029907, path=/***/de2/geek/2021-11-06/geek_48.tfrecord.gz, replicas=1}

2021-11-16 11:47:08,039 INFO PlanCoordinator - Starting job Id=1637058451169 Config=EvictConfig{blockId=828677029907, replicas=1}

2021-11-16 11:48:10,029 INFO PlanCoordinator - Starting job Id=1637058451280 Config=EvictConfig{blockId=828677029907, replicas=1}

2021-11-16 11:49:08,518 INFO PlanCoordinator - Starting job Id=1637058451355 Config=EvictConfig{blockId=828677029907, replicas=1}

2021-11-16 11:50:09,074 INFO PlanCoordinator - Starting job Id=1637058451444 Config=EvictConfig{blockId=828677029907, replicas=1}

2021-11-16 11:50:10,462 INFO PlanCoordinator - Job completed Id=1637058451444 Config=EvictConfig{blockId=828677029907, replicas=1} not know what else will touch a replica job ?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.