Azure / AzureDataLake

Samples and Docs for Azure Data Lake Store and Analytics
http://aka.ms/AzureDataLake
MIT License
139 stars 105 forks source link

USQL SUM aggregator returns null when used with user-defined aggregators #51

Open supernoodles opened 7 years ago

supernoodles commented 7 years ago

Consider the following USQL (reflecting the issue we've observed in our production development):

@data = 
    SELECT
        *
    FROM (VALUES
        ("A",(decimal?)2,   "LabelA","A:1"),
        ("A",(decimal?)null,"LabelA","A:2"),
        ("A",(decimal?)1,   "LabelA","A:3"),
        ("B",(decimal?)4,   "LabelB","B:1")) AS T(Name, Value, Type, Id);

@result = 
    SELECT 
        Name,
        SUM(Value) AS Sum,
        Type,
        AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
    FROM @data
    GROUP BY Name, Type;

OUTPUT @result TO "/res.csv" USING Outputters.Csv(outputHeader:true);

The aggregator in this case is the the sample custom aggregator from the USQL reference doc (our production code is different, but the problem is demonstrable with the sample UDAGG code):

using Microsoft.Analytics.Interfaces;

namespace AggTest
{
    public class genericAggregator : IAggregate<string, string, string>
    {
        string AggregatedValue;

        public override void Init()
        {
            AggregatedValue = "";
        }

        public override void Accumulate(string ValueToAgg, string GroupByValue)
        {
            AggregatedValue += ValueToAgg + ",";
        }

        public override string Terminate()
        {
            // remove last comma
            return AggregatedValue.Substring(0, AggregatedValue.Length - 1);
        }
    }
}

When executed, either within an ADLA instance in Azure or using the USQL local run environment within Visual Studio 2017, the result is:

"Name","Sum","Type","RowId"
"A",,"LabelA","A,A,A"
"B",4,"LabelB","B"

The built-in USQL SUM aggregator has returned NULL rather than the expected output of 3 for the row with Name A. Removing the call to the user-defined aggregator returns a rowset with the expected SUM aggregation value of 3:

"Name","Sum","Type"
"A",3,"LabelA"
"B",4,"LabelB"

This is clearly inconsistent behaviour for the SUM aggregate which shouldn't care if a UDAGG is included in the processing of the same group. Interestingly, if the @result query is modified to:

@result = 
    SELECT 
        Name,
        AVG(Value) AS Avg,
        SUM(Value) AS Sum,
        Type,
        AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
    FROM @data
    GROUP BY Name, Type;

Then the output is:

"Name","Avg","Sum","Type","RowId"
"A",1.5,3,"LabelA","A,A,A"
"B",4,4,"LabelB","B"

In this case the introduction of the AVG aggregator both produces the expected average as well as coaxing the SUM aggregator into also producing the correct answer! At present we've implemented two workarounds:

  1. Use the null coalescing operator within the SUM, i.e.:
    SUM(Value ?? 0.0m) AS Sum
  2. Filter the rowset to ensure that NULLs in the field to sum aren't included in the group (this has the disadvantage that if more than one SUM aggregator is used in a single query and each of the fields may have NULLs then this adds complication).

Obviously neither work around is ideal as we may care that the result of a SUM aggregate is actually NULL if there were no non-NULL values to sum in the group.

supernoodles commented 7 years ago

No response in 10 days from anyone... :-(

Now raising as a support request from within Azure Portal.

saveenr commented 7 years ago

We apologies for the late response :-( But thank you for raising this. The engineering team is investigating.