datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
591 stars 180 forks source link

Change .analysis.xml format to allow "multi-stream" components and independent output streams. #689

Open kaspersorensen opened 9 years ago

kaspersorensen commented 9 years ago

With the prospect of adding a "multi stream" component in #620 we encounter the problem that we get "diamond shaped" component graphs and our current .analysis.xml format only supports "hierarchy shaped" graphs.

Today every output-data-stream is represented as a job which is a child to a component. And a component is a child to a job. Rather we would need to allow some components to exist in multiple streams. And for the future I can very well we should also need to represent multiple streams at the outer-most level (for instance two datastores - or one rainy day to properly model two tables from the same datastore as separate streams in the same job).

My suggestion (off the top of my head) is to preserve compatibility of the existing format but keep it only as a shorthand version of a format like this:

<job>
 <stream id="...">
  <source>...</source>
  <transformation>...</transformation>
  <analysis>...</analysis>
 </stream>
 <stream id="...">
  <source>...</source>
  <transformation>
    ...
    <transformer ref="foo" />
    ...
  </transformation>
  <analysis>...</analysis>
 </stream>
 <components>
  <component id="foo">...</component>
 </components>
</job>

Quite a big change ... Maybe start with first having the lower part "components" tag to allow shared components. Maybe the streams wrapping we can make a separate story.

LosD commented 9 years ago

This seems to be a bit of an incomplete idea, and it seems the hard part is actually what goes where. I'm not sure what you want in the component part? The more I think about it, the more questions I see:

  1. How do we avoid endless chains? We can do it programmatically, but it would be really nice if the schema explicitly disallowed it (unfortunately, I'm not sure that is possible without a hugely complex schema... Or at all).
  2. If the definition is in the part, and the transformation is just a reference (as <transfomer ref="foo" /> seems to suggest), how would you define what columns of which stream should be consumed by the transformer? If you do it on the transformer, the definition of the streams inputs and outputs is not contained within the stream, which would be rather confusing.
  3. What should the stream source contain? How does it reference an output?

My suggestion would be a simpler approach: Make a new fusers element that contains components which are allowed to reference output data streams, but otherwise behaves like a normal component. It will require source ids to be unique across the job, but I think that problem is already there, at least I remember an issue that was solved by adding some form of counting IDs to columns.

In theory, we could allow all components to reference all outputstreams (schema-wise), but I like that it's limited to the special fuse cases, as it makes them more visible.

An example (extension of example-job-output-dataset.analysis.xml). Sorry about the size, but it needs to be reasonably complete to make sense. The new element is in the end of the file:

<?xml version="1.0" encoding="UTF-8"?>
<job xmlns="http://eobjects.org/analyzerbeans/job/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <source>
        <data-context ref="my database" />
        <columns>
            <column id="col_fn" path="PUBLIC.EMPLOYEES.FIRSTNAME" />
            <column id="col_ln" path="PUBLIC.EMPLOYEES.LASTNAME" />
            <column id="col_supervisor0" path="PUBLIC.EMPLOYEES.REPORTSTO" />
        </columns>
    </source>

    <transformation>
        <transformer>
            <descriptor ref="Concatenator" />
            <properties>
                <property name="Separator" value=" " />
            </properties>
            <input ref="col_fn" />
            <input ref="col_ln" />
            <output id="col_fullname0" />
        </transformer>
    </transformation>

    <analysis>
        <analyzer requires="_any_">
            <descriptor ref="Completeness analyzer" />
            <properties>
                <property name="Conditions" value="[NOT_BLANK_OR_NULL,NOT_BLANK_OR_NULL,NOT_BLANK_OR_NULL]" />
                <property name="Evaluation mode" value="ALL_FIELDS"/>
            </properties>
            <input ref="col_fullname0" />
            <input ref="col_supervisor0" />
            <output-data-stream name="Complete rows">
                <job>
                    <source>
                        <columns>
                            <column id="col_supervisor1" path="REPORTSTO"/>
                            <column id="col_fullname1" path="Concat of FIRSTNAME,LASTNAME"/>
                        </columns>
                    </source>
                    <analysis>
                        <analyzer>
                            <descriptor ref="String analyzer"/>
                            <input ref="col_fullname1" />
                        </analyzer>
                        <analyzer>
                            <descriptor ref="Number analyzer"/>
                            <input ref="col_supervisor1" />
                        </analyzer>
                    </analysis>
                </job>
            </output-data-stream>
            <output-data-stream name="Incomplete rows">
                <job>
                    <source>
                        <columns>
                            <column id="col_supervisor2" path="REPORTSTO"/>
                            <column id="col_fullname2" path="Concat of FIRSTNAME,LASTNAME"/>
                        </columns>
                    </source>
                    <analysis>
                        <analyzer>
                            <descriptor ref="String analyzer"/>
                            <input ref="col_fullname2" />
                        </analyzer>
                        <analyzer>
                            <descriptor ref="Number analyzer"/>
                            <input ref="col_supervisor2" />
                        </analyzer>
                    </analysis>
                </job>
            </output-data-stream>
        </analyzer>
    </analysis>

    <fusers>
        <transformer>
            <descriptor ref="Coalesce multiple fields"/>
            <properties>
                <property value="true" name="Consider empty string as null"/>
                <property value="[[REPORTSTO,REPORTSTO],[Concat of FIRSTNAME%2CLASTNAME,Concat of FIRSTNAME%2CLASTNAME]]" name="Units"/>
            </properties>
            <input ref="col_supervisor1"/>
            <input ref="col_supervisor2"/>
            <input ref="col_fullname1"/>
            <input ref="col_fullname2"/>
            <output id="col_reportsto" name="__reportsto_or_reportsto"/>
            <output id="col_concat" name="__concat_of_firstname_lastname_or_concat_of_firstname_lastname"/>
            <output-data-stream name="Fused rows">
                <job>
                    <source>
                        <columns>
                            <column id="col_reportsto" path="__reportsto_or_reportsto"/>
                            <column id="col_concat" path="__concat_of_firstname_lastname_or_concat_of_firstname_lastname"/>
                        </columns>
                    </source>
                    <analysis>
                        <analyzer>
                            <descriptor ref="String analyzer"/>
                            <input ref="col_concat" />
                        </analyzer>
                        <analyzer>
                            <descriptor ref="Number analyzer"/>
                            <input ref="col_reportsto" />
                        </analyzer>
                    </analysis>
                </job>
            </output-data-stream>
        </transformer>
    </fusers>
</job>

Another alternative could be a more graphML-like approach with edges (connections between input and output columns) and nodes/ports (components/columns), but that would be much bigger change from now... Although I think that I like this approach best, for it's clarity and extensibility. However, it might be even harder to control.

kaspersorensen commented 8 years ago

Quite valid points by @LosD. For the moment I don't think we can include this in DC 4.5 since there are too fundamental changes and too much discussion still on how to do it.

But that said, I will add to the discussion (a little bit):

Having now worked a bit more on #620 I realized that I can actually persist the fuse component in XML as long as it only works on source tables. It will end up approximately like this (notice the column paths are from two different tables):

<job>
<source>
 <column id="col_a" path="customers.a" />
 <column id="col_b" path="employees.b" />
</source>
<transformation>
  <transformer>
    ...
    <input ref="col_a" />
    <input ref="col_b" />
  </transformer>
  ...
</job>

In the example "col_a" and "col_b" is referred to regardless of their availability in a particular record format and stream. But the fact that they are both referred means that this component is shared by both processing streams. Maybe we can do something similar. I don't have the magic bullet yet, but just wondering: Could we simply always place such a component in the root <job> element, and then refer to column IDs that may originate also from output data streams? That way we could retain the current XSD for the jobs.

It wouldn't solve the "graph syntax" concerns, but honestly we already have that also in other cases where you can create a cyclic job based on a few transformer output/input combinations. Such conflicts can more easily be detected in the Java code that reads the XML I think.

LosD commented 8 years ago

I don't think it is a problem. My biggest worry is readability, i.e. you don't really know if a top-level transformer doesn't belong to the outermost analysis job without studying IDs, but of course, that isn't really so different from many other reference based approaches.