Closed CaptainEmerson closed 10 years ago
The results as they currently are, are too big for me to analyze. For starters, PowerPivot cannot open files bigger than 2 gigs.
Can someone split the analyses of this set into (for example) 3 separate groups of files? This has to be done in such a way that each set is self-containing (has spreadsheets and the cells that go with it)
AN alternative, by the way, is to run the scantool again on the files. I have changed the output slightly, so that all ids are unique by construction, which enables me to load all results into Neo4J, which is able to process large files. But I do not know how feasible that is.
I can re-process the open source corpus again, but that will take 2-3 days. If that's the best direction, I can start that this morning
John
On Aug 27, 2014, at 8:12 AM, Felienne notifications@github.com wrote:
AN alternative, by the way, is to run the scantool again on the files. I have changed the output slightly, so that all ids are unique by construction, which enables me to load all results into Neo4J, which is able to process large files. But I do not know how feasible that is.
— Reply to this email directly or view it on GitHub.
Cutting down the result files is easier, no? I know @slankas at least started with a bunch of smaller files, so aggregating them into several, say, ~1GB files shouldn't be hard. Can you easily do that, @slankas ?
Otherwise, I can just take the big file and divide it up by using unix split.
I can do the split of the result files. we'd need to be a little smart with the splits to ensure all of the IDs appear together. (ie, can't just divide the files by 3 or 4)
A few of 1 gig should be fine, if we can just make sure the keys match (cells with a spreadsheet are in the same subgroup), so not just cut all files in four or something, or we will misscount.
Go ahead, @slankas
Looks like there are issues in how characters are escaped in the cells.csv file.
the content field (third from the last) in this example has a . but the next character after that is a 7. Parsed records: 3(17): [588751075, 588751075-2049821326, B2, False, , , 1, NUMBERSTRING(H6,1)&""&" \, \"#, ##0\")&\"\"", NUMBERSTRING(R4C6,1)&""&"\, \"#, ##0\")&\"\"", ?93,243, , 0, 0] 4(17): [588751075, 588751075-2049886862, C2, False, , , 1, NUMBERSTRING(H6,1)&""&" \, \"#, ##0\")&\"\"", NUMBERSTRING(R4C5,1)&""&"\, \"#, ##0\")&\"\"", ?93,243, , 0, 0] 5(17): [588751075, 588751075-2049428110, D2, False, , , 1, NUMBERSTRING(H6,1)&""&" \, \"#, ##0\")&\"\"", NUMBERSTRING(R4C4,1)&""&"\, \"#, ##0\")&\"\"", ?93,243, , 0, 0]
Original: 588751075,588751075-2049821326,B2,False,,,1,"NUMBERSTRING(H6,1)&\"\"&\" \"&TEXT(H6,\"#,##0\")&\"\"","NUMBERSTRING(R4C6,1)&\"\"&\"\"&TEXT(R4C6,\"#,##0\")&\"\""," \7,393,243",,0,0 588751075,588751075-2049886862,C2,False,,,1,"NUMBERSTRING(H6,1)&\"\"&\" \"&TEXT(H6,\"#,##0\")&\"\"","NUMBERSTRING(R4C5,1)&\"\"&\"\"&TEXT(R4C5,\"#,##0\")&\"\""," \7,393,243",,0,0 588751075,588751075-2049428110,D2,False,,,1,"NUMBERSTRING(H6,1)&\"\"&\" \"&TEXT(H6,\"#,##0\")&\"\"","NUMBERSTRING(R4C4,1)&\"\"&\"\"&TEXT(R4C4,\"#,##0\")&\"\""," \7,393,243",,0,0
This record was also causing issues "\"
I've also removed any non-ascii characters from the cells.csv file.
I'm still have 1141 bad cell level records out of the ~35 million records in the cells.csv file.
At this point, I'm just going to drop all of the data for those records after the "nsiblings" field. This will remove the formula and the content.
I think the underlying issue is that if \ appeared in the cell, it was not properly escaped in the results.
Sent files to Felienne for processing.
Felienne had issues processing the files sent overnight. I've resent the files this morning without trying to fix the IDs on my side. If this doesn't work, we'll need to re-process the entire set over the weekend.
There were duplications in the set Felienne processed in the morning.
I'm re-running the entire OpenSource set. Currently 1/15 of the set has been processed. I've just worked up to 37 nodes to process.. I'll provide an update in the morning.
process is 1/3 complete running now
with the exception of 4 files, the processing is complete. I'm going to give that process a few more hours to complete before I stop it and send the current results w/ splitting only to Felienne.
@Felienne, could you send @barik and I the full CSV result when you get the chance? Thanks!
I am downloading the files now and will work on this today.
F
On 1 September 2014 03:18, CaptainEmerson notifications@github.com wrote:
@Felienne https://github.com/Felienne, could you send @barik https://github.com/barik and I the full CSV result when you get the chance? Thanks!
Reply to this email directly or view it on GitHub https://github.com/DeveloperLiberationFront/Spreadsheet-Corpus-Paper/issues/3#issuecomment-54007568 .
Done
All 1.5 million of 'em! This bug's important!