NYCPlanning / db-data-library

📚 Data Library
https://nycplanning.github.io/db-data-library/library/index.html
MIT License
0 stars 1 forks source link

`get_execution_details` fails in Github Actions #402

Closed td928 closed 1 year ago

td928 commented 1 year ago

Hey Data Engineering!

I've been trying to set up my version of data library on AWS s3 for a project I am working on. I see there are some new changes to data library and one of them is causing archive function when called from the latest release in Actions.

When something like this below run for development database for example:

./devdb.sh library_archive_version dob_geocode_results 20230709

It throws an error like below. This is after I added git installation to the Actions runner but still the error persists. I wonder if the new module added was tested with some of the existing data products and what adjustments if any you have to make to work with the new improvements.

❱ 73 │   │   git_user = try_func(subprocess.run(["git", "config", "user.name │
│   74 │   │   return {                                                        │
│   75 │   │   │   "type": "manual",                                           │
│   76 │   │   │   "user": git_user,                                           │
│                                                                              │
│ /usr/lib/python3.10/subprocess.py:501 in run                                 │
│                                                                              │
│    498 │   │   kwargs['stdout'] = PIPE                                       │
│    499 │   │   kwargs['stderr'] = PIPE                                       │
│    500 │                                                                     │
│ ❱  501 │   with Popen(*popenargs, **kwargs) as process:                      │
│    502 │   │   try:                                                          │
│    503 │   │   │   stdout, stderr = process.communicate(input, timeout=timeo │
│    504 │   │   except TimeoutExpired as exc:                                 │
│                                                                              │
│ /usr/lib/python3.10/subprocess.py:969 in __init__                            │
│                                                                              │
│    966 │   │   │   │   │   self.stderr = io.TextIOWrapper(self.stderr,       │
│    967 │   │   │   │   │   │   │   encoding=encoding, errors=errors)         │
│    968 │   │   │                                                             │
│ ❱  969 │   │   │   self._execute_child(args, executable, preexec_fn, close_f │
│    970 │   │   │   │   │   │   │   │   pass_fds, cwd, env,                   │
│    971 │   │   │   │   │   │   │   │   startupinfo, creationflags, shell,    │
│    972 │   │   │   │   │   │   │   │   p2cread, p2cwrite,                    │
│                                                                              │
│ /usr/lib/python3.10/subprocess.py:1845 in _execute_child                     │
│                                                                              │
│   1842 │   │   │   │   │   │   err_filename = orig_executable                │
│   1843 │   │   │   │   │   if errno_num != 0:                                │
│   1844 │   │   │   │   │   │   err_msg = os.strerror(errno_num)              │
│ ❱ 1845 │   │   │   │   │   raise child_exception_type(errno_num, err_msg, er │
│   1846 │   │   │   │   raise child_exception_type(err_msg)                   │
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory: 'git'

I also did find at least one instance of the scheduled run in the now archived DevDB main branch see here. If you go to the geocode step and then to the Archive to Data Library step you would see the same error there and the geocoded data probably failed to push to s3 there.

Just want to flag this for your attention and will let you know if I somehow find the answer for this.

Thanks!

All the best.

Te

fvankrieken commented 1 year ago

Hi Te! Thanks for flagging.

A couple notes for fixing

fvankrieken commented 1 year ago

Are you also specifically having issues using the published action? At least in the data library repo, those seem to be working fine on our end. Either way, if we fix the try...except issue you'll be able to run. Other than that, if you're using docker from the command line you could pass the CI variable in with your call to docker to run it right now I believe

td928 commented 1 year ago

@fvankrieken Thanks for the explanation that makes sense to me.Upon closer look it is true devdb is using it as a docker image not the CI. But in my case, I should be able to use your suggestions to use it as an action and sidestep this now. I will keep my eyes on the PR and if it made it into the release (docker and CI) and can test to let you know.

fvankrieken commented 1 year ago

Gonna close this. Docker calls do work now even if execution details might be inaccurate if env variables aren't passed. But if you invoke with -e CI=$CI the execution details will be accurate at least in that regard. Longer term should maybe move away from using git cli to get a username (or let users specify somehow), but it made sense in our dev environment with dev containers (and "vscode" user rather than the actual user's username).