Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

blobxfer broken Azure File download #255

Closed veonua closed 5 years ago

veonua commented 5 years ago

Problem Description

having 140 files to OCR, 57 of them seems to be not downloaded fully
and OCR fails with

Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica Error in readHeaderMemJp2k: image parameters not found Error in pixReadStreamJp2k: failed to read the header Error in pixReadStream: jp2: no pix returned Error in pixRead: pix not read Error during processing.

Batch Shipyard Version

3.6.1

Expected Results

files downloaded fully

Actual Results

only parts of files copied

Redacted Configuration

jobs

job_specifications:
- id: ocr
  tasks:  
  - docker_image: tesseractshadow/tesseract4re

    task_factory:
        file:
          azure_storage:
            storage_account_settings: mystorageaccount
            remote_path: test
            is_file_share: true
            include:
            - '*_1.jpe'

          task_filepath: file_name

    command: /bin/bash -c "install -Dv /dev/null {file_path} | tesseract {file_name} {file_path}"
    output_data:
      azure_storage:
      - storage_account_settings: mystorageaccount
        remote_path: test
        local_path: $AZ_BATCH_TASK_WORKING_DIR/
        is_file_share: true
        include:
        - "*.txt"

  merge_task:
    docker_image: python:3.7-alpine3.7
    input_data:
      azure_storage:
      - storage_account_settings: mystorageaccount
        remote_path: test
        is_file_share: true
        blobxfer_extra_options: '--strip-components 2'
    command: /bin/sh -c "cat ./*/*/*/*.txt > results.txt"
    output_data:
      azure_storage:
      - storage_account_settings: mystorageaccount
        remote_path: output/results
        is_file_share: true
        local_path: $AZ_BATCH_TASK_WORKING_DIR/results.txt

config

batch_shipyard:
  storage_account_settings: mystorageaccount 
global_resources:
  docker_images:
  - tesseractshadow/tesseract4re
  - python:3.7-alpine3.7

pool

pool_specification:
  id: poolf3234
  virtual_network:
    arm_subnet_id: /subscriptions/82a0c17e-006b-470a-967d-f5f4096fe264/resourceGroups/rdtestenv-rg/providers/Microsoft.Network/virtualNetworks/rdtestenv-vnet3/subnets/labvm-subnet3

  vm_configuration:
    platform_image:
      offer: UbuntuServer
      publisher: Canonical
      sku: 18.04-LTS

  vm_count:
    dedicated: 1
    low_priority: 0
  vm_size: STANDARD_D1_V2
  ssh:
    username: shipyard

Additional Logs

stdout

2018-12-23 21:21:02.354 INFO - 
============================================
         Azure blobxfer parameters
============================================
         blobxfer version: 1.5.5
                 platform: Linux-4.15.0-1035-azure-x86_64-with
               components: CPython=3.6.6-64bit azstor.blob=1.4.0 azstor.file=1.4.0 crypt=2.4.1 req=2.20.1
       transfer direction: Azure -> local
                  workers: disk=4 xfer=3 md5=0 crypto=0
                 log file: None
                  dry run: False
              resume file: None
                  timeout: connect=10 read=200 max_retries=1000
                     mode: StorageModes.File
                  skip on: fs_match=False lmt_ge=False md5=False
        delete extraneous: False
                overwrite: True
                recursive: True
            rename single: True
         chunk size bytes: 0
         strip components: 0
         compute file md5: False
       restore properties: attr=False lmt=False
          rsa private key: None
        local destination: /mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe
============================================
2018-12-23 21:21:02.357 INFO - blobxfer start time: 2018-12-23 21:21:02.357239+00:00
2018-12-23 21:21:02.388 DEBUG - dest is_dir=False for 1 specs
2018-12-23 21:21:02.389 INFO - downloading blobs/files to local path: /mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe
2018-12-23 21:21:02.389 DEBUG - spawning 3 transfer threads
2018-12-23 21:21:02.415 DEBUG - spawning 4 disk threads
2018-12-23 21:21:02.628 INFO - MD5: SKIPPED, test/DN/invoices/998/4670117_1.tif None <L..R> None
2018-12-23 21:21:02.696 INFO - MD5: SKIPPED, test/DN/invoices/998/4670117_2.tif None <L..R> None
2018-12-23 21:21:02.779 DEBUG - 0 files 0.0000 MiB filesize and/or lmt_ge skipped
2018-12-23 21:21:02.780 DEBUG - 21 remote files processed, waiting for download completion of approx. 0.6656 MiB
2018-12-23 21:21:02.850 ERROR - exceptions encountered while downloading
2018-12-23 21:21:02.850 ERROR - PosixPath('/mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe')
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 873, in start
    self._run()
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 833, in _run
    raise self._exceptions[0]
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 494, in _worker_thread_transfer
    self._process_download_descriptor(dd)
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 584, in _process_download_descriptor
    self._transfer_cc[dd.final_path] -= 1
KeyError: PosixPath('/mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe')

Additonal Comments

original file is jpeg 102.2kB, copied 37.2kB of some buffer

veonua commented 5 years ago

cp /mnt/batch/tasks/mounts/azfile-storage-test/{file_path} image produces good file. while

bloxfer copy some trash

cp $AZ_BATCH_NODE_SHARED_DIR/test/{file_path} image - file not found

batch_shipyard:
  storage_account_settings: mystorageaccount 
global_resources:
  docker_images:
  - tesseractshadow/tesseract4re
  - python:3.7-alpine3.7
  volumes:
    shared_data_volumes:
      azurefile_vol:
        volume_driver: azurefile
        storage_account_settings: mystorageaccount
        azure_file_share_name: test
        container_path: $AZ_BATCH_NODE_SHARED_DIR/test
        mount_options:
        - file_mode=0777
        - dir_mode=0777
        bind_options: rw
alfpark commented 5 years ago

This will be fixed when the blobxfer issue is resolved. As a workaround, mount the Azure File share as a shared_data_volume and directly copy.