dotmesh-io / dotscience-python

Python library for Dotscience workloads
Apache License 2.0
7 stars 2 forks source link

using ds.input("path-to-dataset") doesn't recursively add it #16

Closed lukemarsden closed 5 years ago

lukemarsden commented 5 years ago

i did ds.input("data") where data is the mountpoint of an S3 dataset, expecting it to recursively add everything inside, instead, i just got this:

Screenshot 2019-07-26 at 14 40 40

on prod, using latest ds curled from the get.dotscience.com just now

lukemarsden commented 5 years ago

possibly because the python lib doesn't detect it as a directory if it's a mountpoint?

alaric-dotmesh commented 5 years ago

According to the setup of that project, the dataset mount point is roadsigns not data.

The run metadata says that it claims to have read data from the workspace dot:

[
    {
        "run_id": "e0bdee53-4da4-45f2-86ae-6f29ef51c2f6",
        "authority": 0,
        "description": "pretended to do some data science with my data",
        "workload_file": "test.py",
        "success": true,
        "workspace_input_files": [
            {
                "filename": "data",
                "version": "02274bf7-fdaf-4cc8-ac5c-5e535a0f1070"
            }
        ],
        "workspace_output_files": [
            "fake-model.mdl"
        ],
        "exec_start": "2019-07-26T13:37:09.967989Z",
        "exec_end": "2019-07-26T13:37:09.96835Z"
    }
]

This is consistent - the data was at roadsigns, so data was nonexistant and a path in the workspace. Presumably the run actually failed on that basis, although it didn't return a nonzero error code as the system thinks it succeeded?

I updated the script to look for roadsigns and ran it in the jupyterlab terminal and got this output:

import dotscience as ds; ds.script()
ds.input("roadsigns")

open(ds.output("fake-model.mdl"), "w").write("hehe")

ds.publish("pretended to do some data science with my data")
[[DOTSCIENCE-RUN:9a794bb8-5d16-49c5-969f-c9ddb1ee03b5]]{
    "description": "pretended to do some data science with my data",
    "end": "20190729T161248.363008",
    "input": [
        "roadsigns/roadsigns.p",
        "roadsigns/signnames.csv"
    ],
    "labels": {},
    "output": [
        "fake-model.mdl"
    ],
    "parameters": {},
    "start": "20190729T161248.362405",
    "summary": {},
    "version": "1",
    "workload-file": "test.py"
}[[/DOTSCIENCE-RUN:9a794bb8-5d16-49c5-969f-c9ddb1ee03b5]]

That's correct! So the dotscience python library is performing as expected.

So: Is ds run not setting up the dataset mount correctly, I wonder?

...nope, that seems right, here's the same through ds run:

51544-06-04 21:32:13.000 Z:  You have not called ds.start() yet, so I'm doing it for you!
51544-06-04 21:32:14.000 Z:  [[DOTSCIENCE-RUN:cdc6b60e-10d2-4b7a-9a5d-9909ab19bcb9]]{
51544-06-04 21:32:14.000 Z:      "description": "pretended to do some data science with my data",
51544-06-04 21:32:14.000 Z:      "end": "20190729T162907.932689",
51544-06-04 21:32:14.000 Z:      "input": [
51544-06-04 21:32:14.000 Z:          "roadsigns/roadsigns.p",
51544-06-04 21:32:14.000 Z:          "roadsigns/signnames.csv"
51544-06-04 21:32:14.000 Z:      ],
51544-06-04 21:32:14.000 Z:      "labels": {},
51544-06-04 21:32:14.000 Z:      "output": [
51544-06-04 21:32:14.000 Z:          "fake-model.mdl"
51544-06-04 21:32:14.000 Z:      ],
51544-06-04 21:32:14.000 Z:      "parameters": {},
51544-06-04 21:32:14.000 Z:      "start": "20190729T162907.931883",
51544-06-04 21:32:14.000 Z:      "summary": {},
51544-06-04 21:32:14.000 Z:      "version": "1",
51544-06-04 21:32:14.000 Z:      "workload-file": "test.py"
51544-06-04 21:32:14.000 Z:  }[[/DOTSCIENCE-RUN:cdc6b60e-10d2-4b7a-9a5d-9909ab19bcb9]]

AFAICT the problem was just that the script was looking for data when it's at roadsigns...