conda-forge / google-cloud-cpp-feedstock

A conda-smithy repository for google-cloud-cpp.
BSD 3-Clause "New" or "Revised" License
2 stars 10 forks source link

Need to split this even more #164

Closed coryan closed 7 months ago

coryan commented 7 months ago

Comment:

162 required multiple attempts and the last one succeeded with only 20m (out of 360m) to spare. We need to split this build even more, or otherwise reduce the total build time.

coryan commented 7 months ago

I made some more measurements locally. The real time is in seconds:

monitoring real 364.12 user 3.91 sys 7.93
retail real 339.48 user 3.34 sys 6.37
appengine real 335.07 user 3.25 sys 6.29
contentwarehouse real 265.08 user 3.65 sys 5.78
asset real 256.28 user 4.37 sys 5.52
discoveryengine real 242.31 user 2.82 sys 4.98
resourcemanager real 235.46 user 2.14 sys 4.12
dataproc real 233.32 user 2.62 sys 4.09
sql real 231.45 user 1.92 sys 4.46
run real 196.32 user 2.21 sys 3.90
datacatalog real 177.49 user 2.39 sys 3.98
talent real 169.42 user 1.94 sys 3.84
notebooks real 157.60 user 2.34 sys 3.34
video real 149.06 user 2.34 sys 3.35
dataplex real 147.86 user 2.56 sys 3.20
osconfig real 140.82 user 2.18 sys 3.42
gkemulticloud real 133.87 user 1.95 sys 2.61
kms real 125.52 user 1.50 sys 2.67
channel real 118.81 user 2.04 sys 2.83
logging real 116.79 user 1.58 sys 2.69
binaryauthorization real 111.75 user 2.15 sys 3.95
automl real 111.37 user 1.91 sys 3.35
beyondcorp real 109.89 user 1.40 sys 2.44
securitycenter real 106.72 user 2.57 sys 4.44
support real 97.51 user 1.24 sys 2.63
servicecontrol real 96.25 user 1.43 sys 2.94
containeranalysis real 95.25 user 1.97 sys 3.52
billing real 92.59 user 1.21 sys 2.10
cloudbuild real 90.50 user 1.51 sys 1.92
artifactregistry real 87.13 user 1.86 sys 3.23
netapp real 85.49 user 1.63 sys 2.31
metastore real 85.39 user 1.29 sys 1.87
vision real 83.63 user 1.41 sys 2.14
baremetalsolution real 83.02 user 1.63 sys 2.33
eventarc real 82.69 user 1.32 sys 2.09
datastore real 82.48 user 1.34 sys 2.19
networkservices real 81.10 user 1.55 sys 2.26
functions real 75.81 user 1.22 sys 1.76
tpu real 75.32 user 1.20 sys 1.70
vmwareengine real 71.77 user 1.69 sys 1.51
connectors real 71.75 user 1.46 sys 2.12
redis real 71.10 user 1.07 sys 1.63
composer real 69.82 user 1.11 sys 1.85
gkebackup real 68.88 user 1.67 sys 2.25
servicedirectory real 68.70 user 1.13 sys 1.79
datamigration real 68.35 user 1.57 sys 1.65
deploy real 67.70 user 1.84 sys 2.20
documentai real 67.31 user 1.48 sys 2.22
workflows real 63.89 user 1.12 sys 1.79
migrationcenter real 62.72 user 1.42 sys 1.37
iap real 62.59 user 0.83 sys 1.52
vmmigration real 62.06 user 1.40 sys 1.40
accesscontextmanager real 60.75 user 1.28 sys 1.77
language real 60.01 user 0.91 sys 1.61
trace real 59.80 user 0.93 sys 1.70
contactcenterinsights real 57.66 user 1.29 sys 1.45
networksecurity real 55.00 user 1.07 sys 1.69
profiler real 53.80 user 0.75 sys 1.53
alloydb real 53.61 user 1.21 sys 1.41
datastream real 51.99 user 1.16 sys 1.37
privateca real 51.00 user 1.14 sys 1.33
container real 50.89 user 1.38 sys 1.20
networkconnectivity real 50.36 user 1.09 sys 1.44
certificatemanager real 50.11 user 1.10 sys 1.45
edgenetwork real 49.79 user 1.04 sys 1.40
translate real 49.24 user 1.08 sys 1.45
gkehub real 48.65 user 1.14 sys 1.53
recommender real 47.82 user 1.04 sys 1.58
servicemanagement real 47.71 user 0.94 sys 1.39
batch real 46.76 user 1.00 sys 1.52
telcoautomation real 46.36 user 1.04 sys 1.24
edgecontainer real 45.60 user 0.96 sys 1.31
apigateway real 45.15 user 0.95 sys 1.33
workstations real 43.54 user 0.99 sys 1.30
optimization real 43.21 user 0.98 sys 1.23
config real 43.04 user 1.04 sys 1.19
tasks real 42.68 user 0.92 sys 1.44
websecurityscanner real 42.62 user 1.24 sys 2.03
networkmanagement real 41.69 user 1.02 sys 1.36
apikeys real 41.28 user 0.77 sys 1.35
filestore real 41.09 user 0.89 sys 1.18
serviceusage real 41.06 user 0.90 sys 1.28
domains real 40.17 user 0.87 sys 1.17
managedidentities real 40.13 user 0.92 sys 1.19
rapidmigrationassessment real 40.12 user 0.89 sys 1.20
policysimulator real 38.16 user 0.81 sys 1.27
assuredworkloads real 38.06 user 0.81 sys 1.20
commerce real 38.01 user 0.80 sys 1.25
securesourcemanager real 37.98 user 0.88 sys 1.17
secretmanager real 37.68 user 0.80 sys 1.41
scheduler real 37.60 user 0.85 sys 1.26
memcache real 37.23 user 0.85 sys 1.14
recaptchaenterprise real 37.13 user 0.81 sys 1.27
datafusion real 36.39 user 0.82 sys 1.16
videointelligence real 35.74 user 0.87 sys 1.17
securitycentermanagement real 35.68 user 0.76 sys 1.20
shell real 35.47 user 0.86 sys 1.11
ids real 35.13 user 0.74 sys 1.19
webrisk real 35.09 user 0.76 sys 1.16
orgpolicy real 34.89 user 0.78 sys 1.20
cloudquotas real 34.59 user 0.74 sys 1.27
vpcaccess real 34.09 user 0.79 sys 1.11
servicehealth real 34.00 user 0.85 sys 1.28
accessapproval real 33.58 user 0.87 sys 1.10
apigeeconnect real 32.61 user 0.76 sys 1.17
oslogin real 32.38 user 0.73 sys 1.19
texttospeech real 31.86 user 0.82 sys 1.20
essentialcontacts real 31.35 user 0.70 sys 1.20
advisorynotifications real 30.96 user 0.78 sys 1.10
timeseriesinsights real 30.41 user 0.73 sys 1.17
confidentialcomputing real 29.59 user 0.73 sys 1.12
resourcesettings real 28.70 user 0.66 sys 1.11

No obvious grouping suggests itself, but maybe monitoring, trace and logging could go into the google-cloud-cpp-core-feedstock, as we are planning to trace in almost every library sometime later this year:

monitoring real 364.12 user 3.91 sys 7.93
logging real 116.79 user 1.58 sys 2.69
trace real 59.80 user 0.93 sys 1.70

To compare, consider existing libraries that we have sharded already:

bigquery real 324.82 user 3.53 sys 6.29

That is comparable to this build:

https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=880141&view=results

I estimate that any of my tests 300s range (e.g. retail or appengine) may save about 20m in the linux_aarch64_ builds.

@dbolduc any thoughts on good sharding strategies here?

h-vetinari commented 7 months ago

Thanks for opening the issue & the timings! In terms of strategies, we should also consider that not everything we peel off must go into a new feedstock, it can also go into any of the existing ones:

coryan commented 7 months ago

we should also consider that not everything we peel off must go into a new feedstock, it can also go into any of the existing ones:

Absolutely.

I did run into problems trying to build a lot of sub-packages in google-cloud-cpp-ai-feedstock. Eventually I just reduced the number of subpackages. See (for examples) https://github.com/conda-forge/staged-recipes/pull/24843/commits/51e09f6c3de7d7342e56f4fe9e1b237f19a84a64

That may have been a limitation of the staged recipes repository. Or something I mistyped. I think it is worthwhile growing the existing feedstocks, but we should be aware of this potential pitfall.

coryan commented 7 months ago

FWIW, #165 looks promising. The CI completed under 5 hours. I expect the build time will go up again once the -core-feedstock gets more subpackages and I change #165 to depend on them, but hopefully not by another hour.

@h-vetinari any problems if we merge https://github.com/conda-forge/google-cloud-cpp-core-feedstock/pull/11 and then merge #165 (with the right additions)?

h-vetinari commented 7 months ago

@h-vetinari any problems if we merge conda-forge/google-cloud-cpp-core-feedstock#11 and then merge #165 (with the right additions)?

Sounds like a plan! Go for it!

coryan commented 7 months ago

After #165 the slowest build took 5h31m. Still fairly tight. The next thing to move is retail and other large features to google-cloud-cpp-ai-feedstock.

Interestingly, the slowest builds on #170 are not the same builds that are slow on #165.

h-vetinari commented 7 months ago

Interestingly, the slowest builds on #170 are not the same builds that are slow on #165.

Azure pipelines has at least two different kind of agents, and the weaker kind is roughly 50% slower than the faster one. On OSX there's also some rare agents that take 3-4x as long as the regular ones.

coryan commented 7 months ago

Azure pipelines has at least two different kind of agents, and the weaker kind is roughly 50% slower than the faster one. On OSX there's also some rare agents that take 3-4x as long as the regular ones.

I expected something like that. It just makes it hard to convince myself an improvement is real.

I any case, with #171 the build times are under 5 hours. I think we can close this bug once that gets merged. I do expect we will need to rebalance things from time to time.

@dbolduc any plans to enable OTel should consider some time spent rebalancing the builds. We can create new feedstocks if needed, or use -compute for other compute services (e.g. appengine, cloud run, and cloud batch, cloud TPU, anything GKE related are arguably about configuring compute resources).