ML-course / master

A machine learning course using Python, Jupyter Notebooks, and OpenML
774 stars 315 forks source link

Error downloading dataset #19

Closed Krulvis closed 2 years ago

Krulvis commented 5 years ago

Running this code the process gets interrupted.

import openml as oml
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# Download Streetview data. Takes a while the first time.
SVHN = oml.datasets.get_dataset(41081)
X, y, cats, attrs = SVHN.get_data(dataset_format='array',
    target=SVHN.default_target_attribute)

Process finished with exit code 137 (interrupted by signal 9: SIGKILL) I have 12gig of memory free, can't figure out why I am getting this interrupt.

joaquinvanschoren commented 5 years ago

Can you tell me on which line the error occurs? If it happens on 'get_dataset' - that only downloads the data and stores it on disk. So then the error could either be that your connection hung up (too slow, network reset,...) or it could be a disk space issue or a permissions issue. OpenML tries to cache the dataset in your home directory (in a folder called '.openml'). You need to have about 3Gb available to store both the original dataset and the python images. It could also be a permissions issue if the process is not allowed to write in your home directory. If it happens on the second line (get_data) then it is more likely to be memory related - maybe your python process does not get enough of your memory.

If you run your notebook from the command line, also check the command line output for any error messages from the OS or Jupyter.

Krulvis commented 5 years ago

It happens on the get_dataset line. When I run it separately it still crashed the jupyter notebook kernel or gets SIGKILLED when ran as python file, however I do see some files in the .openml folder regarding dataset 41081. image

It's probably crashing before downloading everything because when I run it again I get an end of file error from line 180 in the openml/datasets/dataset.py:

            with open(data_pickle_file, "rb") as fh:
                data, categorical, attribute_names = pickle.load(fh) #<- this gives EOFError

The pickle.load(fh) is looking for the .pkl.py3 file which exists but is 0 bytes.

joaquinvanschoren commented 5 years ago

Hmm... Maybe the download was corrupted. Can you remove the '41081' folder from the cache (or remove the entire cache) and try to download again?

Krulvis commented 5 years ago

Ye when I remove it the same happens, first it starts downloading and gets killed: Process finished with exit code 137 (interrupted by signal 9: SIGKILL) When ran again, it gives the EOFError as mentioned before but has the files downloaded with the exact same size in my .openml folder. The connection should be fast enough from my network so I do not understand why it gets killed, how big should the data files be? Can you confirm by looking at the file sizes in the previous screenshot that I am missing data?

joaquinvanschoren commented 5 years ago

Is there any output on the command line that can explain the SIGKILL?

joaquinvanschoren commented 5 years ago

let's take Jupyter out of the equation. Can you just open a python terminal and execute this?

import openml as oml
SVHN = oml.datasets.get_dataset(41081)
Krulvis commented 5 years ago

Is there any output on the command line that can explain the SIGKILL?

No, I also dont know how I can add more debugging options.

let's take Jupyter out of the equation. Can you just open a python terminal and execute this?

import openml as oml
SVHN = oml.datasets.get_dataset(41081)

I already tried this (and mentioned it in a previous comment). It gives the same results. Either it's ran from the jupyter notebook and the kernel dies or its ran from a terminal and it gets killed with the SIGKILL command (I dont know why). Below is the output when ran directly from terminal: image

Another thing to note is that it seems to always download everything quite quickly except for the pkl.py3 file.

Krulvis commented 5 years ago

I just got to JADS and heard similar stories from multiple people. I am definitely not the only one experiencing this problem.

PGijsbers commented 5 years ago

Hey, one of the openml-python devs here. I haven't heard of this happening before and can't reproduce it. I'll do my best to explain what seems to be happening.

Internally the package first downloads the arff file, then converts it to a pandas dataframe which is pickled to disk. The dataset.pkl.py3 file is not downloaded. It seems like the process is killed during the pickling process, as the file is created but it is corrupt. On subsequent loads, the package does not check for corrupt files and assumes that since there is a pickle file, it should use that instead of the arff file. This produces the EOF error.

I'd like for you to try two things:

  1. Run the script with everything still in cache, confirm you get the EOF error. Then delete the dataset.pkl.py3 file (and only that file, not other cached data from the datasets), run it again. This should prompt openml-python to pickle the arff file, but we can isolate it from the other download steps that would have happened before. Please let me know if this still gives the sigkill or other errors.

  2. Download the dataset.pkl.py3 file from here, and place it in the cache directory (instead of the corrupt 0 byte file). Then run the script. The script should load the OpenMLDataset fine (so long as my package versions are similar enough to yours, which is a point of failure).

I'd also appreciate any information you're willing to share regarding your environment: OS, Python version, list of all installed python packages and versions and system specifications.

joaquinvanschoren commented 5 years ago

I just got to JADS and heard similar stories from multiple people. I am definitely not the only one experiencing this problem.

I didn't suggest you are :). I'm just trying to understand what may go wrong since I can't reproduce it locally.

Krulvis commented 5 years ago

I'd like for you to try two things:

1. Run the script with everything still in cache, confirm you get the `EOF` error. Then delete the `dataset.pkl.py3` file (and only that file, not other cached data from the datasets), run it again. This should prompt `openml-python` to pickle the `arff` file, but we can isolate it from the other download steps that would have happened before. Please let me know if this still gives the `sigkill` or other errors.

When I do this, I get the exact same output. I believe the problem originates from pickling the dataset as it is present on my system. After downloading (crashing), re-running and getting the EOF error, deleting the dataset.pkl.py3 and re-running. It gets killed again after a long runtime.

2. Download the `dataset.pkl.py3` file from [here](https://transfernow.net/84kq25p1raig), and place it in the cache directory (instead of the corrupt 0 byte file). Then run the script. The script should load the OpenMLDataset fine (so long as my package versions are similar enough to yours, which is a point of failure).

This temporary solution works fine. Which adds to the theory that the problem stems from pickling.

I'd also appreciate any information you're willing to share regarding your environment: OS, Python version, list of all installed python packages and versions and system specifications.

OS: Arch linux Python: 3.7.4 pip list: https://pastebin.com/t4sKXCg0

Also tested on Mac: OS: Mojave 10.14.6 Python: 3.7.4 pip list: https://pastebin.com/70SwMEyF

PGijsbers commented 5 years ago

Cool, thanks for the additional information and great to hear that replacing the file helps! Definitely isolated the issue to pickling to disk, then. Thanks for the additional information on your system, hopefully it will help us reproduce and fix the issue.

joaquinvanschoren commented 5 years ago

Thanks!

Here is a working environment (hope this helps to narrow it down): OS: Mojave 10.14.5 Python: 3.7.3 pip freeze: https://pastebin.com/WXJdnn31

joaquinvanschoren commented 4 years ago

We're looking into using arrow for caching instead of pickle. It is more reliable for large files and loads a lot faster.