byu-dml / d3m-experimenter

A distributed system for creating, running, and persisting many machine learning experiments.
0 stars 0 forks source link

Get dataset doc path and problem path functions return invalid paths #91

Closed e13h closed 3 years ago

e13h commented 3 years ago

Currently, get_dataset_doc_path() just does some string manipulation on the dataset id that is given. https://github.com/byu-dml/d3m-experimenter/blob/001a68aa988d58cecf2de078ce6bdc72c266641b/experimenter/utils.py#L13-L24

get_problem_path() does something similar. https://github.com/byu-dml/d3m-experimenter/blob/001a68aa988d58cecf2de078ce6bdc72c266641b/experimenter/utils.py#L40-L49

I think that this was done because datasetDoc.json and problemDoc.json followed the same file path structure, but I found an example that breaks.

Using the dataset id: LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_dataset_TEST Corresponding problem id: LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_problem

>>> get_dataset_doc_path('LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_dataset_TEST',
...    '/users/data/d3m/datasets/seed_datasets_current/')

/users/data/d3m/datasets/seed_datasets_current/LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_dataset_TEST/LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_dataset_TEST_dataset/datasetDoc.json

>>> get_problem_path('LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_problem',
...    '/users/data/d3m/datasets/seed_datasets_current/')

/users/data/d3m/datasets/seed_datasets_current/LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_problem/LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA_problem_problem/problemDoc.json

The correct paths for these files are:

/users/data/d3m/datasets/seed_datasets_current/LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA/TEST/dataset_TEST/datasetDoc.json
/users/data/d3m/datasets/seed_datasets_current/LL1_terra_canopy_height_long_form_s4_90_MIN_METADATA/TEST/problem_TEST/problemDoc.json

I suggest an alternative implementation that uses D3M's get_datasets_and_problems() function that traverses the datasets directory and returns dictionaries mapping dataset and problem ids to their corresponding file paths. PR coming soon...