Create scantool metrics for full Open Source dataset

CaptainEmerson commented 10 years ago

All 1.5 million of 'em! This bug's important!

Felienne commented 10 years ago

The results as they currently are, are too big for me to analyze. For starters, PowerPivot cannot open files bigger than 2 gigs.

Can someone split the analyses of this set into (for example) 3 separate groups of files? This has to be done in such a way that each set is self-containing (has spreadsheets and the cells that go with it)

Felienne commented 10 years ago

AN alternative, by the way, is to run the scantool again on the files. I have changed the output slightly, so that all ids are unique by construction, which enables me to load all results into Neo4J, which is able to process large files. But I do not know how feasible that is.

slankas commented 10 years ago

I can re-process the open source corpus again, but that will take 2-3 days. If that's the best direction, I can start that this morning

John

On Aug 27, 2014, at 8:12 AM, Felienne notifications@github.com wrote:

AN alternative, by the way, is to run the scantool again on the files. I have changed the output slightly, so that all ids are unique by construction, which enables me to load all results into Neo4J, which is able to process large files. But I do not know how feasible that is.

— Reply to this email directly or view it on GitHub.

CaptainEmerson commented 10 years ago

Cutting down the result files is easier, no? I know @slankas at least started with a bunch of smaller files, so aggregating them into several, say, ~1GB files shouldn't be hard. Can you easily do that, @slankas ?

Otherwise, I can just take the big file and divide it up by using unix split.

slankas commented 10 years ago

I can do the split of the result files. we'd need to be a little smart with the splits to ensure all of the IDs appear together. (ie, can't just divide the files by 3 or 4)

Felienne commented 10 years ago

A few of 1 gig should be fine, if we can just make sure the keys match (cells with a spreadsheet are in the same subgroup), so not just cut all files in four or something, or we will misscount.

CaptainEmerson commented 10 years ago

Go ahead, @slankas

slankas commented 10 years ago

Looks like there are issues in how characters are escaped in the cells.csv file.

the content field (third from the last) in this example has a . but the next character after that is a 7. Parsed records: 3(17): [588751075, 588751075-2049821326, B2, False, , , 1, NUMBERSTRING(H6,1)&""&" \, \"#, ##0\")&\"\"", NUMBERSTRING(R4C6,1)&""&"\, \"#, ##0\")&\"\"", ?93,243, , 0, 0] 4(17): [588751075, 588751075-2049886862, C2, False, , , 1, NUMBERSTRING(H6,1)&""&" \, \"#, ##0\")&\"\"", NUMBERSTRING(R4C5,1)&""&"\, \"#, ##0\")&\"\"", ?93,243, , 0, 0] 5(17): [588751075, 588751075-2049428110, D2, False, , , 1, NUMBERSTRING(H6,1)&""&" \, \"#, ##0\")&\"\"", NUMBERSTRING(R4C4,1)&""&"\, \"#, ##0\")&\"\"", ?93,243, , 0, 0]

Original: 588751075,588751075-2049821326,B2,False,,,1,"NUMBERSTRING(H6,1)&\"\"&\" \"&TEXT(H6,\"#,##0\")&\"\"","NUMBERSTRING(R4C6,1)&\"\"&\"\"&TEXT(R4C6,\"#,##0\")&\"\""," \7,393,243",,0,0 588751075,588751075-2049886862,C2,False,,,1,"NUMBERSTRING(H6,1)&\"\"&\" \"&TEXT(H6,\"#,##0\")&\"\"","NUMBERSTRING(R4C5,1)&\"\"&\"\"&TEXT(R4C5,\"#,##0\")&\"\""," \7,393,243",,0,0 588751075,588751075-2049428110,D2,False,,,1,"NUMBERSTRING(H6,1)&\"\"&\" \"&TEXT(H6,\"#,##0\")&\"\"","NUMBERSTRING(R4C4,1)&\"\"&\"\"&TEXT(R4C4,\"#,##0\")&\"\""," \7,393,243",,0,0

slankas commented 10 years ago

This record was also causing issues "\"

I've also removed any non-ascii characters from the cells.csv file.

slankas commented 10 years ago

I'm still have 1141 bad cell level records out of the ~35 million records in the cells.csv file.

At this point, I'm just going to drop all of the data for those records after the "nsiblings" field. This will remove the formula and the content.

I think the underlying issue is that if \ appeared in the cell, it was not properly escaped in the results.

slankas commented 10 years ago

Sent files to Felienne for processing.

slankas commented 10 years ago

Felienne had issues processing the files sent overnight. I've resent the files this morning without trying to fix the IDs on my side. If this doesn't work, we'll need to re-process the entire set over the weekend.

slankas commented 10 years ago

There were duplications in the set Felienne processed in the morning.

I'm re-running the entire OpenSource set. Currently 1/15 of the set has been processed. I've just worked up to 37 nodes to process.. I'll provide an update in the morning.

slankas commented 10 years ago

process is 1/3 complete running now

slankas commented 10 years ago

with the exception of 4 files, the processing is complete. I'm going to give that process a few more hours to complete before I stop it and send the current results w/ splitting only to Felienne.

CaptainEmerson commented 10 years ago

@Felienne, could you send @barik and I the full CSV result when you get the chance? Thanks!

Felienne commented 10 years ago

I am downloading the files now and will work on this today.

F

On 1 September 2014 03:18, CaptainEmerson notifications@github.com wrote:

@Felienne https://github.com/Felienne, could you send @barik https://github.com/barik and I the full CSV result when you get the chance? Thanks!

Reply to this email directly or view it on GitHub https://github.com/DeveloperLiberationFront/Spreadsheet-Corpus-Paper/issues/3#issuecomment-54007568 .

Felienne commented 10 years ago

Done

DeveloperLiberationFront / Spreadsheet-Corpus-Paper

Create scantool metrics for full Open Source dataset #3