Reading large arrays seems to take forever and if endup, results in Segmentation fault

ArthurMPassos commented 1 month ago

Reproducing the behavior

Running this script with "bend run-c" I can parse a file until with 2**15 elements with the following code, but I get the beginning of the list to print and then a seg fault when I use a 2**16 elements file. I cant see any issue in the implementation itself that could cause that behavior.

Input file example: Parses a file into a list of integers with the first line being the size of the dataset Example:

The script:

# converts a string to an integer
def CharToInt(char):
  return char - 48

def strLen(l):
  match l:
    case String/Nil: 
      return 0
    case String/Cons: 
      return strLen(l.tail) + 1

def pow(value, power):
  if power == 0:
    return 1
  else:
    return value * pow(value, power - 1)

def StringToInt(str):
  lenght = strLen(str)
  return _StringToInt(str, lenght)

def _StringToInt(str, acc):
  fold str with acc:
    case String/Cons:
      return CharToInt(str.head )*pow(10, acc - 1) + str.tail(acc - 1)
    case String/Nil:
      return 0

def parse_list(txt):
  match txt:
    case String/Cons: 
      (head, tail) = txt
      res = List/Cons { head: StringToInt(head), tail: parse_list(tail) }
    case String/Nil:
      res = List/Nil
  return res

def get_dataset():
  with IO:
    fd <- IO/FS/open("../datasets/sequence_random.txt", "r")

    bytes <- IO/FS/read_to_end(fd)

    match res = Bytes/split_once(bytes, '\n'):
      case Result/Ok:
        (firstLine, rest) = res.val
        sizeAndList = (StringToInt(Bytes/decode_utf8(firstLine)), rest)
      case Result/Err:
        sizeAndList = (0, [])

    (sizeBytes, rest) = sizeAndList
    # first line has size of the dataset
    full_size = StringToInt(Bytes/decode_utf8(sizeBytes))
    bend size = full_size, bytes = rest: 
      when size > 0:
        match res = Bytes/split_once(bytes, '\n'):
          case Result/Ok:
            (line, rest) = res.val
            value = StringToInt(Bytes/decode_utf8(line))
            list = List/Cons { head: value, tail: fork(size - 1, rest) }
          case Result/Err:
            value = 0
            list = List/Nil
      else:
        list = List/Nil
    return (list)

append = @l @e
  (concat l (List/Cons e List/Nil))

concat = @l1 @l2
  match l1 {
    List/Cons: (List/Cons l1.head (concat l1.tail l2))
    List/Nil: l2
  }

split = @l @i (split.aux [] l i)

split.aux = @acc @l @i
  match l {
    List/Cons: 
      switch i {
        0: (acc, l)
        _: (split.aux (append acc l.head) l.tail i-1)
      }
    List/Nil: *
  }

len = @l
  match l {
    List/Nil: 0
    List/Cons: (+ 1 (len l.tail))
  }

def main:
  with IO:
    fd <- IO/FS/open("../datasets/lines_sorted.txt", "r")

    bytes <- IO/FS/read_to_end(fd)

    match res = Bytes/split_once(bytes, '\n'):
      case Result/Ok:
        (firstLine, rest) = res.val
        sizeAndList = (StringToInt(Bytes/decode_utf8(firstLine)), rest)
      case Result/Err:
        sizeAndList = (0, [])

    (sizeBytes, rest) = sizeAndList
    # first line has size of the dataset
    full_size = StringToInt(Bytes/decode_utf8(sizeBytes))
    bend size = full_size, bytes = rest: 
      when size > 0:
        match res = Bytes/split_once(bytes, '\n'):
          case Result/Ok:
            (line, rest) = res.val
            value = StringToInt(Bytes/decode_utf8(line))
            list = List/Cons { head: value, tail: fork(size - 1, rest) }
          case Result/Err:
            value = 0
            list = List/Nil
      else:
        list = List/Nil

    return list

System Settings

HVM: 2.0.21
- Bend: 0.2.36
- OS: Windows 11 WSL 2 with Ubuntu 22.04.4 LTS
- CPU: Intel i5-1235U
- RAM: 32 GB -> 26GB to WSL

Additional context

It might be related to this issue: https://github.com/HigherOrderCO/Bend/issues/632

ArthurMPassos commented 1 month ago

Here a script that generate some files to use as input to reproduce.

import numpy as np

path = "../datasets/"

def generate_uniform_dataset(size, low, high):
    """
    Generate a uniformly distributed dataset.
    """
    return np.random.uniform(low, high, size).astype(int)

def generate_random_dataset(size, low, high):
    """
    Generate a randomly distributed dataset.
    """
    return np.random.randint(low, high, size)

def generate_skewed_dataset(size, low1, high1, low2, high2, skew_ratio=0.8):
    """
    Generate a skewed dataset with the specified ratio.
    """
    size_majority = int(size * skew_ratio)
    size_minority = size - size_majority

    majority_part = np.random.randint(low1, high1, size_majority)
    minority_part = np.random.randint(low2, high2, size_minority)

    return np.concatenate([majority_part, minority_part])

def save_dataset(filename, dataset):
    """
    Save dataset to a file in the format [1, 2, 3, 4].
    """
    with open(filename, 'w') as f:
        f.write(str(len(dataset)) +'\n' + '\n'.join(map(str, dataset)))

if __name__ == "__main__":
    ARRAY_SIZE = 2**16
    # Large random dataset
    large_random_dataset = generate_random_dataset(ARRAY_SIZE , 0, 2**19) # 524288
    save_dataset(path + "lines_random.txt", large_random_dataset)

    # Large skewed dataset
    large_skewed_dataset = generate_skewed_dataset(ARRAY_SIZE, 0, 1000, 1001, 1000000)
    save_dataset(path + "lines_skewed.txt", large_skewed_dataset)

    # Generate and save sorted datasets
    sorted_dataset = np.sort(large_random_dataset)
    save_dataset(path + "lines_sorted.txt", sorted_dataset)

    # Generate and save reverse sorted datasets
    reverse_sorted_dataset = sorted_dataset[::-1]
    save_dataset(path + "lines_reverse.txt", reverse_sorted_dataset)

    print("Datasets generated and saved to files.")

developedby commented 1 month ago

I'll have to test your program later, but some things stand out to me.

First off, IO is indeed very slow right now, but it should be linear with the number of operations (counting each read byte, each argument passed, etc).

The segfault sounds like some problem in the HVM interface with the OS.

I find it surprising that your program works as you expected, because it has some type errors regarding the use of with.

When you do with IO and then you execute a sequential operation with var <- action, the result of that block should be of IO type. You do this both in main and in the unused get_dataset.

developedby commented 1 month ago

Crash possibly the same as https://github.com/HigherOrderCO/Bend/issues/632

HigherOrderCO / Bend