Psteinberg/split builds

PeterDSteinberg commented 8 years ago

Makes separate subparsers, one for the build of packages ('build') and one for the splitting of packages into distinct trees of a target size ('split').

Here is an example of the split action that we would run before submitting builds to anaconda build:

$ python build2.py ./ split -t 5 -s somejs.js && cat somejs.js
{"libnetcdf": ["curl", "cmake", "hdf5", "zlib"], "pysam": ["python", "cython", "cmake", "zlib"]}```

$ python build2.py ./ split -t 10 -s somejs.js && cat somejs.js
{"pysam": ["curl", "cmake", "hdf5", "zlib", "libnetcdf", "python", "cython"]}

-t is the target number of packages per group, -s is the name of a file to save the splits in a json dict

The other usage pattern is build to actually do the build (as would be called from .binstar.yml):

# to build all files in dir ./
python build2.py ./ build -buildall

# to build only the hdf5 package
python build2.py ./ build -build hdf5

# to build all the packages in a key of a json created by the split method mentioned above
# libnetcdf must be a key in somejs.js
python build2.py ./ build -json-file-key somejs.js libnetcdf

msarahan commented 8 years ago

Thanks for working on this. A couple of comments:

The syntax of the command coming after the target path throws me off.
I'm not clear on the purpose of splitting. What is its purpose? In my mind, the "natural" way of doing things is a topological sort. The "right" number of packages at each level is dependent completely on the tree, and trying to force that information feels counterintuitive to me. How does this fit with a topological sort? Do we have one? I thought I understood from Ryan that we do.

PeterDSteinberg commented 8 years ago

The split is aimed at making jobs that are ca. 30 to 60 minutes long at longest for greatest stability of the build workers. The idea of this split is to sort by the high level nodes (those who require the most dependency builds) and to find their successors recursively. The split command produces a json and the order of the dependencies in the list at each value is in topologically correct order of install for that tree, e.g.:

To build / test libnetcdf, first install the list of dependencies from beginning to end, then install libnetcdf:

"libnetcdf": ["curl", "cmake", "hdf5", "zlib"]

Finally smaller tree branches are added together in one job (see coalesce). This can be done by setting the -targetnum per split.

msarahan commented 8 years ago

I see. That makes good sense. I wonder if it is worthwhile to track build times for each package somewhere. Most packages are very quick, but some (Qt, for example) take 20-40 min. Knowing estimates of these might make your approach work better.

PeterDSteinberg commented 8 years ago

Yes we have the build times being logged, so after a few builds we can go through the logs where the times are printed out and put that in the meta.yaml's extra: dict. @groutr started work on that yesterday.

msarahan commented 8 years ago

Thanks!

ContinuumIO / ProtoCI

Psteinberg/split builds #10