WillSullivan / idownvotedbecause

I downvoted because...
http://idownvotedbecau.se
MIT License
265 stars 66 forks source link

suggest: images of data #100

Open r2evans opened 6 years ago

r2evans commented 6 years ago

Strongly related to "images of code", it is equally frustrating to have data (for testing or demonstration) included as an image vice something copy/paste-able. The wording could be related:

Why this is a problem

One of the first things we do when examining how code deals with or modifies data is copy and paste it into our console (or text editor). This might be a convenience (helpful to test for certain constraints in the requirements) or downright critical (e.g., when your code produces unexpected output).

This might be confounded by the difference between how data is represented and how data is stored internally. For instance, in some languages the representation on the console is simplified, perhaps even masking/hiding certain attributes that are critical to understanding how functions or code interacts with it. In many examples, the string may look close enough to the language's representation of it that a static string might be confused with a native object.

For example, in python, it might be clear that an object is a string vice being a proper datetime object:

In [1]: import datetime
In [3]: a
Out[3]: datetime.datetime(2018, 9, 8, 21, 57, 36, 699530)
In [3]: a
Out[3]: datetime.datetime(2018, 9, 8, 21, 57, 36, 699530)
In [5]: b
Out[5]: 'datetime.datetime(2018, 9, 8, 21, 47, 7, 226554)'

made apparent by the enclosing single quotes (indicating a string). However, in R, it is less obvious:

a
# [1] "2018-09-08 21:48:16 PDT"
class(a)
# [1] "POSIXct" "POSIXt" 
b
# [1] "2018-09-08 21:48:07 PDT"
class(b)
# [1] "character"

What to do next

Edit the question. Please include sample data that can be easily copied and used. For most languages, this might be a fixed-width representation of the data, while some languages might have stricter requirements. At times the sample data can be created on-the-fly, meaning programmatically with the language's specific functions for creating that data type, instead of reading in language from a data file.

pinobatch commented 4 years ago

This can become tricky if the sample data exceeds the 30K question size limit, as can easily happen with database or statistics questions.

r2evans commented 4 years ago

@pinobatch I don't disagree that that corner case is problematic. But in my experience (which is likely just a small speck in the big scheme), it is also the exception.

But I don't think that that alone is a problem, for a couple of reasons:

  1. If your problem truly needs so much sample data to demonstrate the problem, then perhaps pastebin or some other large-file-sharing method can be used. But this should still be used to augment, not replace, having sample copy-able data in the question. I think it's a fair thing to say in the data something like:

    My data looks like *this*:
    #                mpg cyl disp  hp drat    wt  qsec vs am gear carb
    # Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    # Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    # Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    
    but since the data is 2B rows, the full dataset can be found *here* (some link).
  2. I think most rules have rational exceptions. I think leaning towards copy-able data as a default and first-attempt is much preferred. There are definitely questions where I find a picture of the data is sufficient to get the point across, but I find it rare that I can provide an answer to questions like that without having to generate my own data to replace that which I don't have readily available (lacking the .NORM file format).

joanise commented 3 years ago

I keep wanting to link here when a user posts a screen capture of a dataframe. +1 for adding this images of data page.