datarobot / datarobot-user-models

User's custom models boilerplate
Other
96 stars 85 forks source link

Recursive search for custom.py can lead to errors #605

Open kindofluke opened 2 years ago

kindofluke commented 2 years ago

What: Within the model adapter, DRUM asserts that there can be only one custom.py file within the code folder.

Here is the snippet at line 139:

    def load_custom_hooks(self):
        custom_file_paths = list(Path(self._model_dir).rglob("{}.py".format(CUSTOM_FILE_NAME)))
        assert len(custom_file_paths) <= 1

the problem lies with the rglob function on a path. The full recursive search for custom.py means the the system will search. all subdirectories in the code folder and find any custom.py file. Several major python packages include a custom.py for other reasons and those files are found within the virtual environment folder.

When multiple custom.py files are found, the assertion on line 139 will fail throwing an error.

Example

The openpyxl has several files named custom.py and those all get captured if your virtual environment is part of the code folder.

Here is a look at custom_file_paths in a debugger. venv_custompy

The assertion will then fail.

I believe that I adopt a fairly common practice of having my virtual environment in my code folder and adding the environment to .gitignore

solution

I think changing rglob to glob will fix this issue. I have tested this locally but I am not able to install all the prerequisites for running the test suite so I can't submit a branch.

kindofluke commented 2 years ago

I was thinking about this more and there may one unintended consequence. I think DRUM was set-up so that users could have kind of a model mono-repo such that they could have a lot of different code and maybe have a subfolder that had a custom.py for datarobot. So consider this structure:

really_cool_predictor/
├─ webapp/
│  ├─ flask.py
├─ model_development.ipynb
├─ artifact.pkl
├─ datarobot-drum/
│  ├─ custom.py

In this case, DRUM won't find the custom.py without using rglob. A couple of ideas,

  1. look for a gitignore file and use the gitignore rules to find the file. This would be cool but would require some work
  2. use the first custom.py entry. The return from rglob is actually ordered. so a custom.py in the main code folder or its sub-folders would always appear before custom.py files buried deep in venv