byu-dml / d3m-experimenter

A distributed system for creating, running, and persisting many machine learning experiments.
0 stars 0 forks source link

added_ci and testing #40

Closed orionw closed 5 years ago

orionw commented 5 years ago
epeters3 commented 5 years ago

I love the idea of adding the tests to CI @orionw. Do you know if the .travis.yml file will run without the travis environment being logged in to registry.datadrivendiscovery.org, where it needs to pull down the docker image? If we need to, we could safely store D3M credentials encrypted and made available in the travis environment, and add a sudo docker login step to the travis script: https://docs.travis-ci.com/user/encryption-keys/

orionw commented 5 years ago

Great point Evan! I totally forgot about the private registry - we'll definitely need to make sure the CI runs before we merge this in. It looks like we'll need to add the login script to the .travis.yml using the docker login and the encrypted keys.

When @bjschoenfeld gets in he can enable the repo for the travis CI and then we can figure the details out. Good catch!

epeters3 commented 5 years ago

Sounds great, nice work!

bjschoenfeld commented 5 years ago

closed and reponed to trigger CI

epeters3 commented 5 years ago

I'm curious to have your opinion @orionw: It looks like you've added all the D3M datasets repo data to this repo. Would it simplify things to instead link the D3M datasets repo as a submodule to this repo, so the data never gets out of sync? Maybe that would complicate things too much; I'd love to have your opinion.

orionw commented 5 years ago

I would love to have the d3m datasets as a submodule to this repo so we don't have to hardcode some in our repo. However, it took me 1.5 hours to get most of the seed datasets downloaded and then I had to cancel the clone command so that it didn't take longer to clone (plus it was a pretty large repo, even with only the small file download option). I think the max time for Travis-CI is 30 or 50 minutes and it does have a memory cap (though that wasn't the main problem).

If there was a way to use it as a submodule and only clone part of the repo that would be awesome! I just couldn't find any options other than the depth parameter for cloning and that didn't speed it up. But you're right that hardcoding a few in here won't help us if the datasets/dataset format changes under us. If you have any ideas I'd be very interested to hear them!

epeters3 commented 5 years ago

BTW you don't have to respond to my comments while you're not clocked in here if you don't want to. Thanks for explaining that. That is tricky. Did using git lfs over git speed things up at all?

orionw commented 5 years ago

No, it's totally fine. I prefer to do it when whoever responds is in so that there isn't a delay in communication. If it's a big question I'll wait til I come in though. Git LFS didn't speed things up - in fact I got a message saying git lfs clone is now being deprecated because the same functionality was ported to git clone.

epeters3 commented 5 years ago

Do you know how many datasets are needed in order to run the tests? If it's just one or two, this is a slightly kludgy but possibly effective option that allows one to selectively clone directories from a rep, and actually skips fetching unneeded objects from the server, so it could be fast: https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository/52269934#52269934.

orionw commented 5 years ago

Thanks for the link! That seems promising and a lot more dynamic than hardcoding! We probably only the main structure of the d3m datasets (aka the outer folders saying seed etc.) and 2-3 datasets (1-2 that we can run on, and one that is not tabular so that we can test skipping incorrect problem types, ie. graphs, images). I think that's the right direction to go!

orionw commented 5 years ago

Looks like it's finally fixed. I only added three datasets to keep it small, but the three datasets still added a lot of files to the PR.

We should squash these commits when merging.

epeters3 commented 5 years ago

This looks great @orionw. Since as you say our local copy of the datasets repo changes very infrequently, I feel good about adding those three datasets to the repo.

They say a good goal to have when implementing a test environment is to make sure the environment is identical to your production environment. It looks like you've done as much as you can there. I'm curious to know why you had to add another docker-compose file specific to test. I couldn't figure it out right away. Also, are the tests interacting with Redis or the DB at all? If so, which Redis/DB instances are they interacting with?

orionw commented 5 years ago

@epeters3 I added the separate docker-compose for test because I wanted it to be able to connect to redis and mongo and interact with them. Otherwise a lot of the tests couldn't actually run (unless I did a whole ton of mocking those classes out and I didn't want to do that - it would be good engineering though). It was pretty easy to add those to the compose file and set default ports to simulate our regular environment.

We don't attach those services to our normal compose file because we only need one of them for the whole lab. If you have a different perspective on how to approach this I would be happy to hear it though!

epeters3 commented 5 years ago

Unless Brandon is wanting to review it first too.