floledermann / mapmap.js

A data-driven API for interactive thematic maps
GNU Affero General Public License v3.0
113 stars 12 forks source link

Handle undefined, empty and invalid field values correctly #23

Closed wahlatlas closed 8 years ago

wahlatlas commented 9 years ago

In case data is not present for a given area, that area gets colored as if it belonged to the lowest class, which is wrong. This is different from missing value handling (#20). Imagine election results coming in during the vote count: The map must show where there is no data (yet), e.g. a grey or otherwise defined background color. "Missing value" on the other hand usually is an encoded value (e.g. -99) and indicates that for reasons within the data there will be no value.

See this example where I deliberately deleted a third of the data in the csv. http://vis.uell.net/mapmap/test_026incomplete.html The only way to find out is that the .hoverInfo tooltip stays empty.

floledermann commented 9 years ago

I have added two new metadata fields, undefinedLabel and undefinedColor to define textual and color representation of undefined values. By default, the color is transparent and the text value is "undefined" - if you want the previous behavior (i.e. value not showing up at all in hoverInfo) then you can set undefinedLabel to null.

This is in the master branch now and also in the documentation, and will be in the next release.

wahlatlas commented 8 years ago

The undefined metadata fields are a very welcome improvement, and show that the whole idea of how to set up the metadata was very well architected in the first place. However I'd like to point out why I think missing values are something different and that the undefined value handling doesn't yet cover the most common data issues. Consider this data set

id,city,data,moredata
xx,Wien,8,5
yy,Linz,,42
zz,Graz,23,-999

Now, as far as I can see, the undefinedcase as currently implemented would cover cities like Klagenfurt or else, that are in the map but not in the data. Yet here we have two other cases that are quite common in real world data: For Graz the data editor deliberately put in a numeric value (-999) that indicates something is wrong here.

On the other hand, in the Linz case, there is some data present (id, city, moredata), but vor the 'data' variable, no value is present. I can only guess that the way rowFileHandlerworks, is the reason, data gets the Value '0' in this case, when in fact it should become undefined.

One could argue that the empty data fields should be edited with undefined values, but in larger datasets (e.g. Germany's 11 thousand municipalities) it is easy to overlook such cases and mapmap.js is probably aimed at people who don't analyse their data with R before (map)mapping it.

floledermann commented 8 years ago

Thanks for going into the details of this and for the (undeserved) compliment - personally I am not very happy with just accumulating "special" stuff in the metadata, and I currently believe things like the color scale should actually go into options to the symbolization function(s). But that is a slightly different topic and would go into a refactoring of the symbolization pipeline scheduled for v0.3. (#25)

I absolutely agree with you that missing field values in the CSV file should be interpreted as undefined and not as zero. I will look into this and reopen the issue for that purpose. It is a limitation of the CSV file format that we cannot distinguish between the empty string and undefined, but in any case it should never be interpreted as 0. In case empty strings are needed (e.g. for ordinal scales), "" can be set as undefinedValue and if a distinction between empty strings and undefined would be needed this could not be mapped to the CSV file format anyway - in this case the workaround would be to use another special value like "-" for undefined values.

I am not sure about these "special" values for numeric fields however - it feels a lot like a "hack" to overcome the limitations of CSV files. But I already learned from studying real world map legends (and when refactoring the legend generation code in mapmap recently) that actually legends often use "nested" scales even for simple choropleth maps, with a numeric/quantizing scale as the main scale, but a "higher-level" ordinal scale for mapping undefined values and possibly other special cases ("not applicable" "under investigation" etc.) - this goes beyond the current capabilities of d3 scales and therefore would need a refactoring of the scale code and probably our own scales implementation.

wahlatlas commented 8 years ago

Ok, I see why that 'special' value idea doesn't fit the d3 scale model. Let me nevertheless extend the undefined issue here

id,city,data,moredata
xx,Wien,8,5
yy,Linz,,42
rr,Salzburg,-,23

In the case of Salzburg the value of data '-' is not taken into consideration by mapmap.js, the fill for it is transparent and it doesn't influence the legend. However undefinedColorwill not be respected in this case even if defined.

floledermann commented 8 years ago

I see the point. I am currently following this approach: attempt to convert strings that strictly look like a number (no extra non-numeric characters, no "", so more strict than parseInt() or +val) to numbers in the CSV reading code, and empty strings to null (but let the user override the parsing function for special cases). Later on, in the symbolization and legend generation phase, find out if we have a numeric scale and if this is the case, treat anything that is not strictly a number (including number-like strings like "0" etc.) as invalid values.

This way, we could theoretically distinguish even 4 special cases:

I am tending to treat all four cases as undefinedValue for the short time, and come up with a proper solution for feeding all these into an ordinal scale "in front" of the numeric scale to map to potentially different representations.

I would never have thought this will get so complicated, and neither JS's barely existing and terribly fragile type system nor D3's quite limited and opaque scale model are helping, unfortunately. In the future we may need to provide our own, extended scale framework (quite sure about this) and may even think about adding type information / field converters to the metadata (not sure if I want to go down that route, though...)

floledermann commented 8 years ago

OK with the last commit all 4 cases (missing-, undefined-, null-values or non-numeric values for numeric scales) should be correctly treated as undefined in the choropleth symbolizer, hoverInfo and the legend. I don't think it makes much sense to go beyond this with the current scale system, as I already have to resort to some hacks to accomplish that. Can you verify this with your data? (You should be able to use the build mapmap.js in the master branch as a drop-in replacement for your version.)

Please refer to #25 for future enhancements or open a new issue for more advance handling of special values.

wahlatlas commented 8 years ago

Yup, works like a charm now, just tested it. This is a very sensible solution, thanks so much and I can understand you want to devote your time on other aspects of mapmap. It might be nitpicking – and I am not even sure how important it is – but the legend for undefined values gets always drawn, once you've provided metadata for it, even if the current map doesn't contain undefined values.

Will keep an eye on the new legend customization.

floledermann commented 8 years ago

You can hide the undefined legend cell if no undefined values are present through CSS:

.legendCell.undefined[data-count="0"] {
    display: none;
}

:)

It was a conscious decision to always output the legend cell for all classes defined in the metadata, as for maps where you can switch the visualized field it may be nicer to have all legend items available at all times for a more stable layout.

This technique can be generally applied to legend cells btw., e.g.

.legendCell[data-count="0"] {
    opacity: 0.5;
}

to fade out cells with no corresponding values as we did on this map.

And don't stop the nitpicking, although, yes, it's a PITA to work on stuff like this I believe the basics are important for a successful library, and for me it's impossible to think about all use cases. Thanks for taking the time to test!

wahlatlas commented 8 years ago

This is very interesting and flexible what you can do here in css, I already successfully tried it out.

Upon testing the latest commits for #23 I see an issue that undefined values seem to harm the quantile scale. A formerly quantile scale with 5 colors still shows 5 colors but the scale differentiates only between the max value and everything else now when undefined values are present.

quantile

Furthermore the histogram count seems to be off for undefined values, see the last example in

/mapmap-examples-master/test/legends.html

where you have 3 districts with no data but the histogram count shows 0 although the histogram bar seems quite right

histocount

floledermann commented 8 years ago

OK the problem in the mapmap-examples seems to be that I forgot to commit the latest mapmap version to be incuded in the examples repo - can you please check, this should work now.

Regarding quantile scales, I consider quantile scales to be kind of unsupported, as they are intended to be fed some actual data as the domain: "the input domain is specified as a discrete set of values" (D3 docs), which will not have been loaded yet when you define the metadata. (This is also related to #12 - eventually we should support specifying the domain with a callback that will be called once the data is loaded to support such dynamic domains). But still I tried to set up a dummy quantile scale, passing it some hardcoded values in the domain, and it worked without problems even with the undefined values file, so can you check with the latest version from master?

If you indeed have a testcase where the quantile scale doesn't work as expected, can you please post a test case, possibly in a new issue as this one is already getting quite long... You can post your test case as a pull request against the examples-repo if you want.

wahlatlas commented 8 years ago

This is all very good and you've fixed everyhing that I could wish for:

The histogram count in test/legends.html indeed now works as expected.

Furthermore the quantile scale also works with undefined values now in my current test cases. This is indeed important as it is more or less the default value per your API Documentation (I can't quite figure out if quantile and quantize do make any difference)