Closed amanda11 closed 2 years ago
Successfully tested st2 3.5dev on EL7 & EL8 on vagrant, with the self-check and manual Web testing.
(I confused my Github accounts and deleted the comment from my other one, mweinberg-cm.)
Successfully tested st2 3.5dev on U18 & U20 in vagrant and the oneline-installer on U20 successfully.
Tested:
Problem found: upgrade doesn't update nginx st2.conf so don't get the TLS upgrade. Generating a doc PR to mention steps to configure this, and also that /etc st2.conf on EL won't get updated on package update, and on Ubuntu you get asked what to do.
Also finding a problem with aws pack when reload config after upgrade due to duplicate key "headers". After upgrade, running:
st2ctl reload --register-all
will complain if any installed actions have duplicate parameters.
Will add a note to upgrade_notes (https://github.com/StackStorm/st2docs/pull/1077) Issue raised on AWS pack: https://github.com/StackStorm-Exchange/stackstorm-aws/issues/114
Also finding a problem with aws pack when reload config after upgrade due to duplicate key "headers". After upgrade, running: st2ctl reload --register-all will complain if any installed actions have duplicate parameters.
Background:
@blag mentioned that issue in slack (#exchange) on May 25th https://stackstorm-community.slack.com/archives/C01GD259JMP/p1621963277006700
A recent change in Orquesta has uncovered a few bugs in Exchange packs. In increasing order of difficulty…
The LastLine pack: 1 https://github.com/StackStorm-Exchange/stackstorm-lastline/blob/master/actions/submit_url.yaml The
verify
key is duplicated in that file. The bugfix would be to remove it. That’s an easy bug to fix if somebody would like to take it on.The FreeIPA pack: https://github.com/StackStorm-Exchange/stackstorm-freeipa/blob/master/actions/env.yaml There are two
server
keys in that file, and we need to figure out which one to remove.The AWS pack: https://github.com/StackStorm-Exchange/stackstorm-aws/blob/master/actions/apigateway_test_invoke_authorizer.yaml There are two
headers
keys in that file, and we need to figure out which one to remove. This could be as simple as regenerating the actions, but it can turn into a rabbit hole.
I believe he noticed these issues due to exchange CI failures, so only 3 exchange packs should have this issue: LastLine, FreeIPA, and AWS.
There are actually 2 PRs involved:
The orquesta change showed how to register the UniqueKeyLoader so that it is always used.
The st2 change used that same method so that all yaml files will be loaded with the UniqueKeyLoader. UniqueKeyLoader was only used for loading the openapi specs before the st2 change. So, failing to register actions was an unexpected behavioral change caused by that st2 PR.
Thanks - I've raised issues on the 3 packs with "help wanted" label, and identifying them as incompatible at moment with 3.5.
On upgraded El8 tested editing workflows from examples that had retry/delay in, all ok. Created new workflows with retry and delay in ok.
Issue found on attempting upgrade with migration script - /opt/stackstorm/st2/bin/st2-migrate-db-dict-field-values isn't present on system after upgrading packages or on install. We seem to be missing code that picks up the version specific migration scripts to include in the bin directory.
Seeing problems installing slack pack on focal, due to complaints about libxml. No issues on EL8 or bionic. Might be to do with hardcoding of version of libxml in pack, rather than anything to do with st2 itself. Other packs install fine.
Currently validating a manual install of single node on EL8. Install went fine, just a minor doc change PR raised as found mention of ewc: https://github.com/StackStorm/st2docs/pull/1078. Need to do some more validation checks but all good so far...
Issue found on attempting upgrade with migration script - /opt/stackstorm/st2/bin/st2-migrate-db-dict-field-values isn't present on system after upgrading packages or on install. We seem to be missing code that picks up the version specific migration scripts to include in the bin directory.
Good catch.
Could be that we need to update rules in st2-packages.
I thought we only need to do that if we want to install something into /usr/local/bin or similar, but should work out of the box for the virtual environment (since we just use setup.py metadata for that), but I could be wrong.
@Kami it looks like in past we've put st2common/bin/migrations/
@armab Edit, actually I believe the right path is /opt/stackstorm/st2/bin/st2common/bin/migrations/v3.5/st2-migrate-db-dict-field-values
. Can you please verify? Maybe just the upgrade notes need to be updated.
EDIT 2: Yeah I'm also just looking through git history in st2-packages and don't see anything which would indicate how that ever worked in the past :)
Ah, so yeah, we utilized setup.py metadata, but looks like we removed old entries a while back https://github.com/StackStorm/st2/commits/bfb506363f1bc3e6cdd4f20d012ad85b7fc166af).
Will open st2 pr.
From manual install:
# pwd
/opt/stackstorm
[root@el8-man-install stackstorm]# find . -name "st2*migrate*db*dict*"
[root@el8-man-install stackstorm]#
Upgrade on EL8 using the migration script successful - note - didn't have huge number of executions as was fresh system created just for testing upgrade. Multiple get command tested.
Some quick testing with Focal manual install and Bionic bash installer. Included integration with chatops setup to slack.
Fix for the Docker nginx config patch, based on recent nginx changes in stackstorm/st2: https://github.com/StackStorm/st2-dockerfiles/pull/48
Successfully tested:
st2
for example on focal Was able to test the fixes for the Slack pack successfully.
Will do some more tests regarding upgrades from 3.4 to 3.5dev tomorrow.
Just verified that the TLS 1.3 support works in Vagrant and via the bash installer.
Tested manual and bash installs on EL7, and upgrade on EL7.
Validated that on xenial if try and install unstable it complains unsupported version.
Manual U18 ok.
Myself and @winem both tested U18 upgrade
Thanks - I've raised issues on the 3 packs with "help wanted" label, and identifying them as incompatible at moment with 3.5.
I can also confirm that I bumped into the 'duplicate keys' problem with pack configuration files.
I was running st2 pack install ....
to install a pack.
The duplicate key was in a config file.
It was a little difficult to diagnose because I had to dig through the logs and find the key that was the error.
The pack install command only returned a generic 500 Internal Server Error
So we are NOT going to allow duplicate keys then? Fine with me , it is just a little hard to debug when one has to dig into the logs if it is the config file that has issues.
With pack config problems I have found that you get more error information on st2ctl reload --register-configs, when you put in 'bad config' compared, to installing a pack where the config file is already present. I've found that for other errors in pack config, as opposed to duplicate keys - but I wonder if its the same.
Rejecting duplicate keys is not good, and we do have mention of it in the upgrade notes - and I'll make sure its in the release blog as well.
We're ready to prepare the StackStorm v3.5 release and start pre-release testing..
Release Process Preparation
Per Release Management Schedule @amanda11 is the Release Manager and @winem is Assisting for v3.5. They will freeze the
master
for the major repositories in StackStorm org, follow the StackStorm Release Process which is now available to public, accompanied by the Useful Info for Release managers. Communication is happening in#releasemgmt
and#development
Slack channels. The first step is pre-release manual user-acceptance testing forv3.5dev
.Why Manual testing?
StackStorm is very serious about testing and has a lot of it:
That's a perfect way to verify what we already know and codify expectations about how StackStorm should function.
However it's not enough. There are always new unknowns to discover, edge cases to experience and tests to add. Hence, manual Exploratory Testing is an exercise where entire team gathers together and starts trying (or breaking) new features before the new release. Because we're all different, perceive software differently and try different things we might find new bugs, improper design, oversights, edge cases and more.
This is how StackStorm previously managed to land less major/critical bugs into production.
TL;DR
See the Testing Process, where it will walk you through:
Additionally, try to use that StackStorm instance as you normally would, maybe try to break it in new and interesting ways that you haven't tried before, and report any regressions found comparing to
v3.4
.At this stage, the following installation methods for 3.5 are available for testing:
Extra points for PR hotfixes, reporting entirely new bugs, and missing test cases!
Specific changes to test
If you have successful test results, please post a summary of what all you tested (OSes, what features you tested).
If you run into any bugs, please open them in the respective repositories and link to this issue from there. I will add them to the list at the bottom of this description.
If you have any issues running StackStorm or running the tests, please post down below.
Major changes
Full Changelog
Changes which are recommended to ack, explore, check and try in a random way.
st2
Added
Added web header settings for additional security hardening to nginx.conf: X-Frame-Options, Strict-Transport-Security, X-XSS-Protection and server-tokens. #5183
Contributed by @shital.
Added support for
limit
andoffset
argument to thelist_values
data store service method (#5097 and #5171).Contributed by @anirudhbagri.
Various additional metrics have been added to the action runner service to provide for better operational visibility. (improvement) #4846
Contributed by @Kami.
Added sensor model to list of JSON schemas auto-generated by
make schemasgen
that can be used by development tools to validate pack contents. (improvement)Added the command line utility
st2-validate-pack
that can be used by pack developers to validate pack contents. (improvement)Fix a bug in the API and CLI code which would prevent users from being able to retrieve resources which contain non-ascii (utf-8) characters in the names / references. (bug fix) #5189
Contributed by @Kami.
Fix a bug in the API router code and make sure we return correct and user-friendly error to the user in case we fail to parse the request URL / path because it contains invalid or incorrectly URL encoded data.
Previously such errors weren't handled correctly which meant original exception with a stack trace got propagated to the user. (bug fix) #5189
Contributed by @Kami.
Make redis the default coordinator backend.
Fix a bug in the pack config loader so that objects covered by an additionalProperties schema can use encrypted datastore keys and have their default values applied correctly. #5225
Contributed by @cognifloyd.
Add new
database.compressors
anddatabase.zlib_compression_level
config option which specifies compression algorithms client supports for network / transport level compression when talking to MongoDB.Actual compression algorithm used will be then decided by the server and depends on the algorithms which are supported by the server + client.
Possible / valid values include: zstd, zlib. Keep in mind that zstandard (zstd) is only supported by MongoDB >= 4.2.
Our official Debian and RPM packages bundle
zstandard
dependency by default which means setting this value tozstd
should work out of the box as long as the server runs MongoDB >= 4.2. #5177Contributed by @Kami.
Add support for compressing the payloads which are sent over the message bus. Compression is disabled by default and user can enable it by setting
messaging.compression
config option to one of the following values:zstd
,lzma
,bz2
,gzip
.In most cases we recommend using
zstd
(zstandard) since it offers best trade off between compression ratio and number of CPU cycles spent for compression and compression.How this will affect the deployment and throughput is very much user specific (workflow and resources available). It may make sense to enable it when generic action trigger is enabled and when working with executions with large textual results. #5241
Contributed by @Kami.
Mask secrets in output of an action execution in the API if the action has an output schema defined and one or more output parameters are marked as secret. #5250
Contributed by @mahesh-orch.
Changed
All the code has been refactored using black and black style is automatically enforced and required for all the new code. (#5156)
Contributed by @Kami.
Default nginx config (
conf/nginx/st2.conf
) which is used by the installer and Docker images has been updated to only support TLS v1.2 and TLS v1.3 (support for TLS v1.0 and v1.1 has been removed).Keep in mind that TLS v1.3 will only be used when nginx is running on more recent distros where nginx is compiled against OpenSSL v1.1.1 which supports TLS 1.3. #5183 #5216
Contributed by @Kami and @shital.
Add new
-x
argument to thest2 execution get
command which allowsresult
field to be excluded from the output. (improvement) #4846Update
st2 execution get <id>
command to also display executionlog
attribute which includes execution state transition information.By default
end_timestamp
attribute andduration
attribute displayed in the command output only include the time it took action runner to finish running actual action, but it doesn't include the time it it takes action runner container to fully finish running the execution - this includes persisting execution result in the database.For actions which return large results, there could be a substantial discrepancy - e.g. action itself could finish in 0.5 seconds, but writing data to the database could take additional 5 seconds after the action code itself was executed.
For all purposes until the execution result is persisted to the database, execution is not considered as finished.
While writing result to the database action runner is also consuming CPU cycles since serialization of large results is a CPU intensive task.
This means that "elapsed" attribute and start_timestamp + end_timestamp will make it look like actual action completed in 0.5 seconds, but in reality it took 5.5 seconds (0.5 + 5 seconds).
Log attribute can be used to determine actual duration of the execution (from start to finish). (improvement) #4846
Contributed by @Kami.
Various internal improvements (reducing number of DB queries, speeding up YAML parsing, using DB object cache, etc.) which should speed up pack action registration between 15-30%. This is especially pronounced with packs which have a lot of actions (e.g. aws one). (improvement) #4846
Contributed by @Kami.
Underlying database field type and storage format for the
Execution
,LiveAction
,WorkflowExecutionDB
,TaskExecutionDB
andTriggerInstanceDB
database models has changed.This new format is much faster and efficient than the previous one. Users with larger executions (executions with larger results) should see the biggest improvements, but the change also scales down so there should also be improvements when reading and writing executions with small and medium sized results.
Our micro and end to benchmarks have shown improvements up to 15-20x for write path (storing model in the database) and up to 10x for the read path.
To put things into perspective - with previous version, running a Python runner action which returns 8 MB result would take around ~18 seconds total, but with this new storage format, it takes around 2 seconds (in this context, duration means the from the time the execution was scheduled to the time the execution model and result was written and available in the database).
The difference is even larger when working with Orquesta workflows.
Overall performance improvement doesn't just mean large decrease in those operation timings, but also large overall reduction of CPU usage - previously serializing large results was a CPU intensive time since it included tons of conversions and transformations back and forth.
The new format is also around 10-20% more storage efficient which means that it should allows for larger model values (MongoDB document size limit is 16 MB).
The actual change should be fully opaque and transparent to the end users - it's purely a field storage implementation detail and the code takes care of automatically handling both formats when working with those object.
Same field data storage optimizations have also been applied to workflow related database models which should result in the same performance improvements for Orquesta workflows which pass larger data sets / execution results around.
Trigger instance payload field has also been updated to use this new field type which should result in lower CPU utilization and better throughput of rules engine service when working with triggers with larger payloads.
This should address a long standing issue where StackStorm was reported to be slow and CPU inefficient with handling large executions.
If you want to migrate existing database objects to utilize the new type, you can use
st2common/bin/migrations/v3.5/st2-migrate-db-dict-field-values
migration script. (improvement) #4846Contributed by @Kami.
Add new
result_size
field to theActionExecutionDB
model. This field will only be populated for executions which utilize new field storage format.It holds the size of serialzed execution result field in bytes. This field will allow us to implement more efficient execution result retrieval and provide better UX since we will be able to avoid loading execution results in the WebUI for executions with very large results (which cause browser to freeze). (improvement) #4846
Contributed by @Kami.
Add new
/v1/executions/<id>/result[?download=1&compress=1&pretty_format=1]
API endpoint which can be used used to retrieve or download raw execution result as (compressed) JSON file.This endpoint will primarily be used by st2web when executions produce very large results so we can avoid loading, parsing and formatting those very large results as JSON in the browser which freezes the browser window / tab. (improvement) #4846
Contributed by @Kami.
Update
jinja2
dependency to the latest stable version (2.11.3). #5195Update
pyyaml
dependency to the latest stable version (5.4). #5207Update various dependencies to latest stable versions (
bcrypt
,appscheduler
,pytz
,python-dateutil
,psutil
,passlib
,gunicorn
,flex
,cryptography
.eventlet
,greenlet
,webob
,mongoengine
,pymongo
,requests
,pyyaml
,kombu
,amqp
,python-ldap
).5215, https://github.com/StackStorm/st2-auth-ldap/pull/94
Contributed by @Kami.
Update code and dependencies so it supports Python 3.8 and Mongo DB 4.4 #5177
Contributed by @nzloshm @winem @Kami.
StackStorm Web UI (
st2web
) has been updated to not render and display execution results larger than 200 KB directly in the history panel in the right side bar by default anymore. Instead a link to view or download the raw result is displayed.Execution result widget was never optimized to display very large results (especially for executions which return large nested dictionaries) so it would freeze and hang the whole browser tab / window when trying to render / display large results.
If for some reason you want to revert to the old behavior (this is almost never a good idea since it will cause browser to freeze when trying to display large results), you can do that by setting
max_execution_result_size_for_render
option in the config to a very large value (e.g.max_execution_result_size_for_render: 16 * 1024 * 1024
).https://github.com/StackStorm/st2web/pull/868
Contributed by @Kami.
Some of the config option registration code has been refactored to ignore "option already registered" errors. That was done as a work around for an occasional race in the tests and also to make all of the config option registration code expose the same consistent API. #5234
Contributed by @Kami.
Update
pyywinrm
dependency to the latest stable version (0.4.1). #5212Contributed by @chadpatt .
Monkey patch on st2stream earlier in flow #5240
Contributed by Amanda McGuinness (@amanda11 Ammeon Solutions)
Support % in CLI arguments by reading the ConfigParser() arguments with raw=True.
This removes support for '%' interpolations on the configuration arguments.
See https://docs.python.org/3.8/library/configparser.html#configparser.ConfigParser.get for further details. #5253
Contributed by @winem.
Remove duplicate host header in the nginx config for the auth endpoint.
Update orquesta to v1.4.0.
Improvements
CLI has been updated to use or
orjson
when parsing API response and C version of the YAML safe dumper when formatting execution result for display. This should result in speed up when displaying execution result (st2 execution get
, etc.) for executions with large results.When testing it locally, the difference for execution with 8 MB result was 18 seconds vs ~6 seconds. (improvement) #4846
Contributed by @Kami.
Update various Jinja functiona to utilize C version of YAML
safe_{load,dump}
functions and orjson for better performance. (improvement) #4846Contributed by @Kami.
For performance reasons, use
udatetime
library for parsing ISO8601 / RFC3339 date strings where possible. (improvement) #4846Contributed by @Kami.
Speed up service start up time by speeding up runners registration on service start up by re-using existing stevedore
ExtensionManager
instance instead of instantiating newDriverManager
instance per extension which is not necessary and it's slow since it requires disk / pkg resources scan for each extension. (improvement) #5198Contributed by @Kami.
Add new
?max_result_size
query parameter filter to theGET /v1/executiond/<id>
API endpoint.This query parameter allows clients to implement conditional execution result retrieval and only retrieve the result field if it's smaller than the provided value.
This comes handy in the various client scenarios (such as st2web) where we don't display and render very large results directly since it allows to speed things up and decrease amount of data retrieved and parsed. (improvement) #5197
Contributed by @Kami.
Update default nginx config which is used for proxying API requests and serving static content to only allow HTTP methods which are actually used by the services (get, post, put, delete, options, head).
If a not-allowed method is used, nginx will abort the request early and return 405 status code. #5193
Contributed by @ashwini-orchestral
Update default nginx config which is used for proxying API requests and serving static content to not allow range requests. #5193
Contributed by @ashwini-orchestral
Drop unused python dependencies: prometheus_client, python-gnupg, more-itertools, zipp. #5228
Contributed by @cognifloyd.
Update majority of the "resource get" CLI commands (e.g.
st2 execution get
,st2 action get
,st2 rule get
,st2 pack get
,st2 apikey get
,st2 trace get
,st2 key get
,st2 webhook get
,st2 timer get
, etc.) so they allow for retrieval and printing of information for multiple resources using the following notation:st2 <resource> get <id 1> <id 2> <id n>
, e.g.st2 action.get pack.show packs.get packs.delete
This change is fully backward compatible when retrieving only a single resource (aka single id is passed to the command).
When retrieving a single source the command will throw and exit with non-zero if a resource is not found, but when retrieving multiple resources, command will just print an error and continue with printing the details of any other found resources. (new feature) #4912
Contributed by @Kami.
Fixed
Refactor spec_loader util to use yaml.load with SafeLoader. (security) Contributed by @ashwini-orchestral
Import ABC from collections.abc for Python 3.10 compatibility. (#5007) Contributed by @tirkarthi
Updated to use virtualenv 20.4.0/PIP20.3.3 and fixate-requirements to work with PIP 20.3.3 #512 Contributed by Amanda McGuinness (@amanda11 Ammeon Solutions)
Fix
st2 execution get --with-schema
flag. (bug fix) #4846Contributed by @Kami.
Fix SensorTypeAPI schema to use class_name instead of name since documentation for pack development uses class_name and registrar used to load sensor to database assign class_name to name in the database model. (bug fix)
Updated paramiko version to 2.7.2, to go with updated cryptography to prevent problems with ssh keys on remote actions. #5201
Contributed by Amanda McGuinness (@amanda11 Ammeon Solutions)
Update rpm package metadata and fix
Provides
section for RHEL / CentOS 8 packages.In the previous versions, RPM metadata would incorrectly signal that the
st2
package provides various Python libraries which it doesn't (those Python libraries are only used internally for the package local virtual environment).https://github.com/StackStorm/st2-packages/pull/697
Contributed by @Kami.
Make sure
st2common.util.green.shell.run_command()
doesn't leave stray / zombie processes laying around in some command timeout scenarios. #5220Contributed by @r0m4n-z.
Fix support for skipping notifications for workflow actions. Previously if action metadata specified an empty list for
notify
parameter value, that would be ignored / not handled correctly for workflow (orquesta, action chain) actions. #5221 #5227Contributed by @khushboobhatia01.
Clean up to remove unused methods in the action execution concurrency policies. #5268
st2web
Changed
Added
Removed
orquesta 1.4.0
Changed
Fixed
st2chatops
Changed
Added
Removed
Conclusion
Please report findings here and bugs/regressions in respective repositories. Depending on severity and importance bugs might be fixed before the release or postponed to the next release if they're very minor and not a release blocker.
Issues Found During Release
PRs Merged for Release
TODOs