MLTable - AzureML - Cache Environment variables

FrsECM commented 5 months ago

Operating System

Linux

Version Information

mltable-1.6.1 azureml-dataprep-rslex~=2.22.2dev0

Steps to reproduce

Run a job on a compute that is size S
Mount Datastore as folder with mltables - Datastore total size> S
Wait...
Crash

For example, in AzureMachine Learning :

storage_paths = [
{'folder':'azureml://subscriptions/$sub/resourcegroups/$rg/workspaces/$ws/datastores/$ds/paths/'}
]
tbl = mltable.from_paths(storage_paths )
mount_context = tbl._mount()
mount_context.start()
# Iterate over files

In order to fix my issue, i need to add extra mount settings : https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2?view=azureml-api-2&tabs=python#available-mount-settings

I use a wrapper class in order to do this on multiple storage / containers :

@dataclass
class MyStorage:
    mount_paths:List[int] = field(init=False,default_factory=list)
    _is_mounted:bool = field(init=False,default=False)
    _mount_context:Any = field(init=False,default=None)

    def __post_init__(self):
        os.environ['DATASET_MOUNT_CACHE_SIZE']="-40GB" # We leave at least 50GB available on the cluster.
        os.environ['DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED']="True"

    def mount(self):
        print(f'Start Mounting storage...')
        [print(f"- {path['folder']}") for path in self.mount_path]
        tbl = mltable.from_paths(self.mount_paths)
        self._mount_context = tbl._mount()
        self._mount_context.start()
        self._is_mounted = True
        print(f'Mount Done - {self._mount_context.mount_point}')

    def umount(self):
        if self._is_mounted:
            print(f'Start UnMounting  - {self._mount_context.mount_point}')
            self._mount_context.stop()
            self._mount_context=None
            self._is_mounted = False
            print('UnMount Done...')

    def __del__(self):
        self.umount()

storage = MyStorage()
storage.mount_paths = storage_paths
storage.mount()
# Do stuff 
del storage

I also tried to add the environment variable in the yaml job :

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

experiment_name: LARGE-JOB
display_name: Large Job

environment_variables:
  DATASET_MOUNT_CACHE_SIZE: "-40 GB"
  DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "True"
  DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET: "0.0"

....

But none of theses solutions are working well.

Expected behavior

I expect that the disk cache is pruned when it is reaching the -40GB limit on the compute machine.

Actual behavior

Currently, the cache continues to grow :

Until fail :

Even if i set environment variables in yaml :

or in code :

And i can confirm that the environment variable are used in the job :

But it seems mltables are ignoring them.

Addition information

No response

FrsECM commented 5 months ago

For people who may have the problem, i got a fix :

import re 
import os 

def mount_options()->MountOptions:
    max_size = None
    free_space_required = None
    cache_param = os.getenv('DATASET_MOUNT_CACHE_SIZE',None)
    if cache_param:
        CACHE_SIZE_PATTERN = r'^(?P<sign>-?)(?P<val>\d+).*(?P<size>[A-Z]{2})$'
        match = re.match(CACHE_SIZE_PATTERN,cache_param)
        if match:
            size = match.group('size')
            if size == 'GB':
                coeff = 1024**3
            elif size =='MB':
                coeff = 1024**2
            else:
                raise NotImplementedError(f'Not implemented for size {size}')
            value = int(match.group('val'))*coeff
            if match.group('sign')=='-':
                # We are in mode "free_space_required"
                free_space_required = value
                print(f'MountOption : {value} Max Free Space')
            else:
                # We are in mode "max_size"
                max_size = value
                print(f'MountOption : {value} Max Size')
    return MountOptions(max_size=max_size,free_space_required=free_space_required)

###### You can now consume your mltable 
storage_paths = [
{'folder':'azureml://subscriptions/$sub/resourcegroups/$rg/workspaces/$ws/datastores/$ds/paths/'}
]
tbl = mltable.from_paths(storage_paths )
mount_context = tbl._mount(mount_options=mount_options())
mount_context.start()

If i do this way, it works, but it ignores the prune target :

Anyway, it's a bug for me, the behaviour should be consistent with the documentation.

IvanHahan commented 1 month ago

I have the same bug. data caching eats up all memory on 64Gb disk. Cant store training checkpoints. Tried setting DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true, but error arises, cant set boolean type. When I set DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true" , nothing happens. Data keeps getting cached

FrsECM commented 1 month ago

I have the same bug. data caching eats up all memory on 64Gb disk. Cant store training checkpoints. Tried setting DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true, but error arises, cant set boolean type. When I set DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true" , nothing happens. Data keeps getting cached

Normally you can use the fix i did, just set the DATASET_MOUNT_CACHE_SIZE env variable with a size and normally it should work.

But anyway it should be fixed....

FrsECM commented 1 week ago

Another concern we have is that we can not set other parameters like theses two :

It would allow us to grab less data than required because when we use a shuffled dataloader there is no interest to cache more block than the average image size.

Would it be possible to opensource mltable ?

Azure / azureml-examples