SDITools / adobeanalyticsr

R Client for Adobe Analytics API v2.0
Other
18 stars 9 forks source link

Refactor freeform table function #99

Closed charlie-gallagher closed 2 years ago

charlie-gallagher commented 2 years ago

Description

I refactored aw_freeform_table pretty much from scratch. I say "refactored" because none of the user interface has changed, but the implementation has been completely overhauled.

I removed a number of functions and R files that were only used in this one report function, but I did my best to leave others alone. I did not change much with any other functions, such as aw_anomaly_report().

I hope you don't mind such a large change. Starting from scratch was far from my first choice, but I couldn't make much progress with the existing function. I hope this will make development much easier and more reliable in the future.

Motivation and context

Features

New features

I added the following features, which to the best of my ability will not affect existing code, but will make the package more reliable and predictable:

Changes

I changed the messaging because it is no longer possible to break up the queries by dimension.

UPDATE: I added a progress bar. To trigger it, at least 20 queries must be planned. It remains incomplete if not all planned queries are executed. Also, it adds the progress R package as a dependency, which may or may not be desirable. It may affect the minimum version of R.

Structure

There are three parts to the query:

  1. Convert user's inputs into a consistent format
  2. Construct individual requests
  3. Make all requests necessary to build the requested freeform table

The aw_freeform_table function is only responsible for the first part, preparing user inputs. There is a suite of functions for constructing requests, which add layers of abstraction to make them simple and predictable. Making the requests is handled by get_req_data, a function which is called recursively as needed to build the table.

The goal was to relieve the programmer of unnecessary burdens by restricting what each level is responsible for. Constructing metric containers is a good example of this. When making a new request, the programmer only has to call metric_container() with the proper arguments. The metric_container function handles the problem of lining up the metric filter IDs and the proper metrics, but it doesn't have to worry about the structure of either field within the container. That's the job of metric_elems and metric_filters. And so on.

Querying the Data

The new flow is different from the old one. The old structure was like a breadth-first search, it gathered all of the dimension values at one level before it started querying the next level. The new structure is like a depth-first search, because gathers the data for a combination of dimension levels completely bbefore moving on to the next one.

For example, if you have dim1, dim2, and dim3, the old version is like this:

# Gather all values of dim1
dim1
 |_val1
 |_val2
 |_val3

# Gather all values of dim2 based on dim1
dim1
 |_val1
   |_dim2
    |_val1
    |_val2
    |_val3
 |_val2
   |_dim2
    |_val1
    |_val2
    |_val3
 |_val3
   |_dim2
    |_val1
    |_val2
    |_val3

# Collect all values of dim2 and get values of dim3 with metrics
dim1
 |_val1
   |_dim2
    |_val1         metric
      |_dim3       ------
        |_val1     xx,xxx
        |_val2     xx,xxx
        |_val3     xx,xxx
    |_val2
      |_dim3
        |_val1     xx,xxx
        |_val2     xx,xxx
        |_val3     xx,xxx
    |_val3
      |_dim3
        |_val1     xx,xxx
        |_val2     xx,xxx
        |_val3     xx,xxx
etc...

The new version works one dimension level combination at a time:

# Gather all values of dim1
dim1
 |_val1
 |_val2
 |_val3

# Gather all values of dim2 for dim1:val1
dim1
 |_val1
  |_dim2
    |_val1
    |_val2
    |_val3
 |_val2
 |_val3

# Gather all values of dim3 for (dim1:val1, dim2:val1)
dim1
 |_val1
  |_dim2
    |_val1
      |_dim3       metric
        |_val1     xx,xxx
        |_val2     xx,xxx
        |_val3     xx,xxx
    |_val2
    |_val3
 |_val2
 |_val3

Each level also has the responsibility of tacking on the name of the dimension that it is filtered based on. This is a nice bit of encapsulation that simplifies post-processing the data.

Related Issue

Issue #100

How Has This Been Tested?

I tested this thoroughly, but not exhaustively (what would it mean to exhaustively test this anyway?).

I ran a battery of generated queries with every combination of four dimensions, four metrics, and three segments, which covered:

I also tried queries with multiple types of date ranges, including dates, numerics, character strings, and POSIXcts with timespans specific to a few hours.

Throughout the whole testing process I compared the results to Adobe Workspace.

Types of changes

Checklist: