forcedotcom / SFDX-Data-Move-Utility

SFDMU is a cutting-edge Salesforce data migration tool for seamless org population from other orgs or CSV files. It handles all CRUD operations on multiple related objects in one go.
BSD 3-Clause "New" or "Revised" License
451 stars 76 forks source link

[FEATURE REQUEST] Add support for UNION in SELECT statements #614

Closed R0Wi closed 1 year ago

R0Wi commented 1 year ago

Is your feature request related to a problem? Please describe.

In our current usecase we're trying to migrate SBQQ__ProductOption__c objects (so we need their corresponding Product2 references SBQQ__ConfiguredSKU__c and SBQQ__OptionalSKU__c). Because we're using our own external id fields, we are currently running something like this in our export.json:

{
    "objects": [
            {
                "query": "SELECT Id, readonly_false FROM Product2",
        "operation": "Readonly",
                "externalId": "MyExternalId__c"
            },
            {
                "query": "SELECT Id, readonly_false FROM SBQQ__ProductOption__c",
                "operation": "Upsert",
                "externalId": "MyExternalId__c"
             }
       ]
}

This works but it's quite inefficient, since we're downloading all Product2 objects instead of only the ones we really need. So the query we'd like to use is:

SELECT Id, readonly_false FROM Product2 WHERE Id IN (SELECT SBQQ__ConfiguredSKU__c FROM SBQQ__ProductOption__c) UNION
SELECT Id, readonly_false FROM Product2 WHERE Id IN (SELECT SBQQ__OptionalSKU__c FROM SBQQ__ProductOption__c)

The problem here is, that Salesforce SOQL does not support UNION and UNION ALL. The workaround described in most of the threads is to combine the data on client side.

Describe the solution you'd like

Add support for a client side implementation of UNION and/or UNION ALL which executes each statement independently and combines the results afterwards.

Contribution

If you're interested in this feature, I'd be happy to support you in the implementation šŸ‘

hknokh2 commented 1 year ago

Hello, Thank you for reaching out to me. I'm grateful for your interest in the feature and your enthusiasm to contribute to its development. Indeed, it appears to be a significant adjustment, and I'll need some time to carefully consider the potential impacts on the plugin architecture and its overall functionality. Rest assured, I will keep you informed of any updates as I progress with the evaluation. Once again, thank you for your valuable contribution and understanding. Cheers,

hknokh2 commented 1 year ago

I've made some observations on this feature from a technical standpoint.

Salesforce doesn't have internal support for multiple semi-join sub-selects with the 'OR' operator. As a result, implementing this feature would require maintaining two independent queries across the entire process, which can be challenging and may lead to regressions.

At the moment, I have decided not to pursue it, but I will keep it pending for additional future review. Thank you again for bringing it up!

Meanwhile, I recommend exploring the "objectSets" feature, which might mimic the UNION functionality if you split the query into different object sets. You can find more information about it here: https://help.sfdmu.com/full-documentation/advanced-features/multiple-object-sets

Regards

R0Wi commented 1 year ago

Thanks for your comprehensive feedback @hknokh! Just a few additional notes from my end:

Regarding objectSets

Meanwhile, I recommend exploring the "objectSets" feature, which might mimic the UNION functionality if you split the query into different object sets.

I think in our usecase this is not possible. We are pulling in the data from an org into CSV files first, so we need the Product.csv file to contain all Products appearing as SBQQ__ConfiguredSKU__c or SBQQ__OptionalSKU__c to be able to resolve our external id references. In my understanding this has to be done in only one objectSet because otherwise we're talking about two different Product.csv files.

Regarding multiple queries

I completely understand your doubts that it might be a mess to maintain tow (or more queries) throughout the whole process. I was just wondering if it would help to introduce a separate property for "additional queries" like this:

{
    "objects": [
            {
                "query": "SELECT Id, readonly_false FROM Product2 WHERE Id IN (SELECT SBQQ__ConfiguredSKU__c FROM SBQQ__ProductOption__c)",
        "operation": "Readonly",
                "externalId": "MyExternalId__c",
                "additionalQueries": [
                    "SELECT Id, readonly_false FROM Product2 WHERE Id IN (SELECT SBQQ__OptionalSKU__c FROM SBQQ__ProductOption__c)"
                ]
            },
            {
                "query": "SELECT Id, readonly_false FROM SBQQ__ProductOption__c",
                "operation": "Upsert",
                "externalId": "MyExternalId__c"
             }
       ]
}

That way we could leave the original query untouched and only add additional data to the result, if additionalQueries is set.

hknokh2 commented 1 year ago

Hello,

I understand your viewpoint. However, the complexity I referred to is not associated with parsing the union query. Parsing such a query is indeed simple and does not necessitate any additional queries. It's important to remember that each query still necessitates its own unique treatment.

The core issue lies within the plugin's extensive operations, such as field metadata detection and costruction of dependency schema, field mapping, as well as the construction of child 'where' clauses when the object is designated as a 'slave', not a 'master'. Now all operations should be done for each query independently. For now I dont see a way how it can be done all together.

At present, a significant part of the code is only designed to work with a single query. Adapting this to handle more scenarios isn't as simple as it might initially appear.

R0Wi commented 1 year ago

Got it, thanks for the explanation. It's rather a "nice to have" feature which would improve our loading performance. So currently, using SELECT Id, readonly_false FROM Product2 works fine, but the amount of data being fetched leads to quite bad performance in our case.

I will try to find some workaround and share my results here. Thanks again!

hknokh2 commented 1 year ago

Sounds good! Thanks!

hknokh commented 1 year ago

I'm closing this case for now. Please let me know if you have any further questions. Regards.

R0Wi commented 1 year ago

I think I found a workaround for my problem in the docs: ScriptObject.master. If I use the following export.json:

{
    "objects": [
            {
                "query": "SELECT Id, readonly_false FROM Product2",
        "operation": "Readonly",
                "externalId": "MyExternalId__c",
                "master": false
            },
            {
                "query": "SELECT Id, readonly_false FROM SBQQ__ProductOption__c",
                "operation": "Upsert",
                "externalId": "MyExternalId__c"
             }
       ]
}

only those Product2 records are pulled which are needed to resolve the SBQQ__ProductOption__c.SBQQ__ConfiguredSKU__c and SBQQ__ProductOption__c.SBQQ__OptionalSKU__c references.

false will enforce the Plugin to detect and process the minimum possible sub-set of records, which is required for keeping relationships between objects.

In my case the Product2.csv shrinks from ~100k records to only ~15k records, which improves the performance massively.