DeveloperLiberationFront / Spreadsheet-Corpus-Paper

1 stars 0 forks source link

Reprocess scantool metrics for Enron Spreadsheets w/ updated tool #13

Open slankas opened 10 years ago

slankas commented 10 years ago

Files sent at 8:20 est to Felienne

Felienne commented 10 years ago

Done! See email. Not the filenames are different fro this one and there are more files, as you had a different strategy for choosing unique files (not based on filenames as I did)

barik commented 10 years ago

I'm reopening this issue. How difficult would it be to replicate the features that were used in the original EUSES paper? I'll send a longer e-mail with justification for why this is desirable in a longer e-mail (one to reduce surface attack area, and the other to replicate the EUSES experiment).

See the list of 14 metrics: http://cse.unl.edu/~grother/papers/weuse05.pdf

Felienne commented 10 years ago

@barik To ease the discussion, here is the list.

  1. Number of input cells (non-empty cells without formulas).
  2. Number of input cells with values of each of the following types: error, boolean, date, non-integer number, integer number, string.
  3. Number of input cells referenced by other cells.
  4. Number of input cells referenced by other cells with values of each of the following types: error, boolean, date, non-integer number, integer number, string.
  5. Number of formula cells
  6. Number of formula cells that evaluate to a value of each of the following types: error, boolean, date, non-integer number, integer number, string, blank.
  7. Number of formula cells that contain references to other cells.
  8. Number of formula cells that are referenced by other cells.
  9. Number of formula cells that use each of the following functions: sumif, countif, choose, hlookup, index, indirect, lookup, match, offset, if.
  10. Number of formulas that occur only once in a spreadsheet (according to copy/paste semantics).
  11. Number of formulas that occur more than once in a spreadsheet (according to copy/paste semantics).
  12. Number of times the most frequently occurring formula occurs in spreadsheet.
  13. Whether the spreadsheet includes any charts.
  14. Whether the spreadsheet includes any VBA macros.
Felienne commented 10 years ago

We we already have: 5

What I can obtain easily with the existing output: 7, 8, 10, 11, 12

What I can obtain less easily with the existing output: 6 (I need to look into to the distinguish the types and how the EUSES paper did it exactly) 9 (I need to do more detailed function analysis)

For which we need a new run: 1, 2, 3, 4 We do not make the distinction between "input" cells and other cells. 13, 14 And I have never analyzed this, so it might take long to know how to

Let me know how you want me to proceed based on this info.

barik commented 10 years ago

Great. Let's proceed by trying to do the ones that we can do (7, 8, 10, 11, 12). For the metrics that we can't do (maybe because it takes too much effort), we'll just say so in the paper ("Our tool does not make a distinction between input cells and other cells...").

Doing 6 and 9 would be really useful. 9 particularly.

I'll take a stab at 13 and 14 and we can just join the results. I think I might have a hacky way to detect those by creating an Excel automation object directly in .NET.

Thanks a lot.

Felienne commented 10 years ago

Okay, I'll try to make a new analysis with 7, 8, 10 to 12 and I'll see about 6 and 9. Probably will be (my) tomorrow.

barik commented 10 years ago

I can do 6 now.

Felienne commented 10 years ago

Awesome! I have 7, 8, 10, 11, 12 and 9. Emailing it now for the Enron set.