AbsaOSS / enceladus

Dynamic Conformance Engine
Apache License 2.0
29 stars 14 forks source link

Feature/ecs paths mapping script #2197

Closed dk1844 closed 7 months ago

dk1844 commented 8 months ago

This PR adds a script to remap hdfs paths based on a service response. Primary usage is for hdfs to ECS migration (with defaults set for this purpose), but the script is general in nature.

Naively Dev-Tested on local mongoDB.

When reading and thinking that some parts have no relation to this script (e.g. migration_free_only=False), note, that this script reuses a lot of the sibling migrate_menas.py - that being the reason.

Examples

Help with params overview:

> python dataset_paths_to_ecs.py --help                                                                                                                                                        
usage: dataset_paths_to_ecs [-h] [-n] [-v] [-t TARGETDB] [-u MAPPINGSERVICE] [-p MAPPINGPREFIX] [-s SKIP_PREFIX [SKIP_PREFIX ...]] [-f {hdfsPath,hdfsPublishPath,all}] [-d DATASET_NAME [DATASET_NAME ...]] [-o] TARGET

Menas MongoDB path changes to ECS

positional arguments:
  TARGET                connection string for target MongoDB

options:
  -h, --help            show this help message and exit
  -n, --dryrun          if specified, skip the actual changes, just print what would be done. (default: False)
  -v, --verbose         prints extra information while running. (default: False)
  -t TARGETDB, --target-database TARGETDB
                        Name of db on target to be affected. (default: menas)
  -u MAPPINGSERVICE, --mapping-service-url MAPPINGSERVICE
                        Service URL to use for path change mapping. (default: https://set-your-mapping-service-here.execute-api.af-south-1.amazonaws.com/dev/map)
  -p MAPPINGPREFIX, --mapping-prefix MAPPINGPREFIX
                        This prefix will be prepended to mapped path by the Mapping service (default: s3a://)
  -s SKIP_PREFIX [SKIP_PREFIX ...], --skip-prefixes SKIP_PREFIX [SKIP_PREFIX ...]
                        Path with these prefixes will be skipped from mapping (default: ['s3a://', '/tmp'])
  -f {hdfsPath,hdfsPublishPath,all}, --fields-to-map {hdfsPath,hdfsPublishPath,all}
                        Map either item's 'hdfsPath', 'hdfsPublishPath' or 'all' (default: all)
  -d DATASET_NAME [DATASET_NAME ...], --datasets DATASET_NAME [DATASET_NAME ...]
                        list datasets names to change paths in (default: [])
  -o, --only-datasets   if specified, mapping table changes will NOT be done. (default: False)

Example run for datasets DM9_actn_Cd and DM9_cnsmr_accnt_Sttlmnt

-d - dataset -t - target db -u - mapping service URL -o - only map datasets, not related mapping tables -f hdfsPublishPath - only hdfsPublishPath field will get path-changed (so hdfsPath will be kept as-is).

> python dataset_paths_to_ecs.py mongodb://localhost:27017/admin -d DM9_actn_Cd DM9_cnsmr_accnt_Sttlmnt -t menas_remap_test -u https://<redacted>.execute-api.af-south-1.amaz
onaws.com/test/map  -f hdfsPublishPath -o
Menas mongo ECS paths mapping
Running with settings: dryrun=False, verbose=False
Using mapping service at: https://<redacted>.execute-api.af-south-1.amazonaws.com/test/map
  target connection-string: mongodb://localhost:27017/admin
  target DB: menas_remap_test
Dataset names to path change (actually found db): ['DM9_actn_Cd', 'DM9_cnsmr_accnt_Sttlmnt']
Configured *NOT* to path-change related mapping tables.

Path changing of collection dataset_v1 started
Found: 3 dataset documents for a potential path change. In progress ...
Successfully migrated 3 of 3 dataset entries, failed: 0

Done.

Example run for dataset XMSK083 - has mapping table ties:

-d - dataset -t - target db -u mapping service URL -n dryrun (just print) -v verbose

> python dataset_paths_to_ecs.py mongodb://localhost:27017/admin -d XMSK083 -t menas_remap_test -v -u https://<redacted>.execute-api.af-south-1.amazonaws.com/test/map -n    
Menas mongo ECS paths mapping
Running with settings: dryrun=True, verbose=True
Using mapping service at: https://<redacted>.execute-api.af-south-1.amazonaws.com/test/map
  target connection-string: mongodb://localhost:27017/admin
  target DB: menas_remap_test
Dataset names given: ['XMSK083']
Dataset names to path change (actually found db): ['XMSK083']
MTs to path change: ['SourceSystemMappingTable']

Path changing of collection dataset_v1 started
Found: 1 dataset documents for a potential path change. In progress ...
Changing paths for dataset 'XMSK083' v5 (_id=5bbc544b2cdc7510a4930f1f).
  *would set* hdfsPath: /bigdatahdfs/datalake/raw/cpf/XMSK083/ -> s3a://<redacted>-prod-edla-cpf-za/raw/XMSK083/, hdfsPublishPath: /bigdatahdfs/datalake/publish/cpf/XMSK083/ -> s3a://<redacted>-prod-edla-cpf-za/publish/XMSK083/

Successfully migrated 0 of 1 dataset entries, failed: 0

Path changing of collection mapping_table_v1 started
Found: 2 mapping table documents for a potential path change. In progress ...
Changing paths for mapping table 'SourceSystemMappingTable' v5 (_id=5b6d732ba43a28a6151422aa).
  *would set* hdfsPath: /bigdatahdfs/datalake/common/mdrc/publish/LATEST/SourceSystemMapping -> s3a://<redacted>-prod-edla-mdrc-za/common/publish/LATEST/SourceSystemMapping/

Changing paths for mapping table 'SourceSystemMappingTable' v1 (_id=5abbaa1e8cdba293c9f0b5a3).
  *would set* hdfsPath: /bigdatahdfs/datalake/common/mdrc/publish/LATEST5/SourceSystemMapping -> s3a://<redacted>-prod-edla-mdrc-za/common/publish/LATEST5/SourceSystemMapping/

Successfully migrated 0 of 2 mapping table entries, failed: 0

Done.
dk1844 commented 8 months ago

Skip prefixes feature has been updated:

sonarcloud[bot] commented 7 months ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

dk1844 commented 7 months ago

Merging - tested internally. Jenkins build bears no relevance here, this is a separate migration Python script