frictionlessdata / data-quality-dashboard

Data Quality Dashboards display statistics on a collection of published data.
Other
33 stars 10 forks source link

Bug with tables data and display #21

Closed hdurand closed 9 years ago

hdurand commented 9 years ago

On the main page with the publishers table and on the publisher page with the sources table, sometimes this happens with the tables:

Example with the publishers table on the main page: tablebug

hdurand commented 9 years ago

It comes from an error with CSV parsing here: https://github.com/okfn/spend-publishing-dashboard/blob/master/dashboard/src/scripts/utils/APIUtils.js#L20

Pull request #22 fixes it.

hdurand commented 9 years ago

Let's take a closer look at the CSV parsing.

The code parses 4 CSV files: publishers.csv, sources.csv, results.csv and runs.csv. Let's use those data for our example: https://github.com/okfn/spd-data-uk-cabinet-office/tree/master/data

The CSV files are parsed here: https://github.com/okfn/spend-publishing-dashboard/blob/master/dashboard/src/scripts/utils/APIUtils.js#L20

The code is as follows:

var parse = require('csv-parse');
var parseParams = {columns: true};

function getCSVEndpoint(endpoint, name) {
    request
        .get(endpoint)
        .end(function(error, response) {
            parse(response.text, parseParams, function(error, output) {
                var raw = output;
                ActionCreators.receiveAll(raw, name);
            });
        });
}

The csv-parse module is a parser converting CSV text input into arrays or objects (see documentation here ). With the option {columns: true} the parser autodiscovers the fields in the first CSV line and returns an array of objects.

Let's take a look at the array of objects returned by the parser by adding a console.log() in the code like this:

var parse = require('csv-parse');
var parseParams = {columns: true};

function getCSVEndpoint(endpoint, name) {
    request
        .get(endpoint)
        .end(function(error, response) {
            parse(response.text, parseParams, function(error, output) {
                console.log(output); // shows output in the web console
                var raw = output;
                ActionCreators.receiveAll(raw, name);
            });
        });
}

When we load the main page of our app we can now see the following elements in the web console:

bug1arrays

Those 4 arrays of objects correspond to the 4 parsed CSV files. Let's look at the two first objects of the first array:

bug1array1

This array corresponds to sources.csv here: https://github.com/okfn/spd-data-uk-cabinet-office/blob/master/data/sources.csv

Each object is a CSV row. There are 53 objects corresponding to 53 CSV rows. The keys are the fields in the first CSV line. If we compare this to the file everything seems to be well parsed.

Let's now look at the second array:

bug1array2

The keys of this object are the same as those from sources.csv (with an additional undefined key). The values correspond to the first line of results.csv here: https://github.com/okfn/spd-data-uk-cabinet-office/blob/master/data/results.csv

So the parser seems to use the fields of the first parsed file (here sources.csv) while parsing results.csv. Also, there is just one object here but 53 rows in results.csv.

Let's look at the third array:

bug1array3

It looks like the parser is parsing runs.csv: https://github.com/okfn/spd-data-uk-cabinet-office/blob/master/data/runs.csv

It uses again some keys from the first parsed file (sources.csv).

Let's look at the last array:

bug1array4

The same thing happens with publishers.csv: https://github.com/okfn/spd-data-uk-cabinet-office/blob/master/data/publishers.csv

Now we load the page again. This time we can see this in the console:

bug2arrays

The first array is like this:

bug2array1

This time results.csv is parsed first and everything seems to be good for this file.

Let's look at the second array:

bug2array2

The parser parses publishers.csv. Again, the parser seems to use the fields of the first parsed file (here results.csv) while parsing publishers.csv.

The same thing happens with the third array (sources.csv):

bug2array3

And again with the last array (runs.csv):

bug2array4

If we load the page again, we will see the same problem. The 4 CSV files can be parsed in a different order each time, so the arrays of objects returned can be different. But the problem remains the same.

Now let's change the code for this:

var parse = require('csv-parse');
// we don't use the parseParams variable anymore
// var parseParams = {columns: true};

function getCSVEndpoint(endpoint, name) {
    request
        .get(endpoint)
        .end(function(error, response) {
            // the option to autodiscover the fields in the first CSV line is now given directly as a parameter
            parse(response.text, {columns: true}, function(error, output) {
                console.log(output); // shows output in the web console
                var raw = output;
                ActionCreators.receiveAll(raw, name);
            });
        });
}

When we load the page we can see those elements in the web console:

fix1arrays

The first array is:

fix1array1

publishers.csv seems to be well parsed.

The second array is:

fix1array2

Now the second parsed file, runs.csv, seems to be well parsed. The keys correspond to the fields in runs.csv. The values are the good ones. The number of objects correspond to the number of rows.

The third array is:

fix1array3

The third parsed file, results.csv, seems to be well parsed too.

The last array is:

fix1array4

sources.csv seems to be well parsed.

If we load the page again, the files are always well parsed. So the problem seems to be fixed.

Note that when this CSV parsing bug occurs, the structure of the parsed data is broken. Several bugs result from it:

However, the first file is always well parsed. So when we look at the web page, we can sometimes think that the parsing went well because the page displays data from the file that have been parsed first. But the above tests show the parsing bug occurs each time.

pwalsh commented 9 years ago

Hi @hdurand

Nice investigation. Seeing your commit and following through the notes above, it still seems weird to me that the problem is solved (I'm not saying it is not: I just don't understand it).

But, I do see something else in your screenshots that could be related (or maybe not):

There message from react library there in your screenshots about child objects in an array having a key property. the warning from React gives this link: http://facebook.github.io/react/docs/multiple-components.html#dynamic-children.

I see I did set a key property here: https://github.com/okfn/spend-publishing-dashboard/blob/master/dashboard/src/scripts/components/tables/SourceTable.react.js#L29

and here: https://github.com/okfn/spend-publishing-dashboard/blob/master/dashboard/src/scripts/components/tables/PublisherTable.react.js#L28

But there may be somewhere else that I haven't set?

Anyway, good job diving into the code :).

hdurand commented 9 years ago

Hi @pwalsh

About the fix: I can't really explain why this fix works either. When I first found the bug, I removed the {columns: true} option. The parser then returned an array of arrays for each file and I converted it in an array of objects. It worked well.

Then I tried to replicate the bug with the original code except that I added directly the option in parameter. That's how I discovered it works that way (lucky change).

The last solution being simpler I dropped my first fix for this simpler one.

About the warning: Good point about the warning. I've just checked that and I think it's a consequence of the bad parsing.

The warning says: Each child in an array should have a **unique** key prop.

Here: https://github.com/okfn/spend-publishing-dashboard/blob/master/dashboard/src/scripts/components/tables/PublisherTable.react.js#L28

we have <tr key={obj.name}> but when the publishers file is not well parsed, there is not necessarily a name key for the object obj. This results in <tr> not having a key and its <td> children not having unique keys. However, sources.csv has a field name. So when the sources file is well parsed (and the publishers file is not) there is a name key for the object obj and <tr> and its <td> children have valid unique keys. So when the publishers file or the sources file are well parsed, there is no warning.

This explanation seems coherent with the above examples: