Closed pdurbin closed 5 years ago
0d9d467 is a complete mess but I'm starting to construct a TSV file with the 50 files currently in this repo. π
Hi Phil, thanks for opening this issue. I think the outcome that we want for the end of next week is to have the data from this dataset (https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/U5DCAW) in the format described above.
Related to issue https://github.com/IQSS/dataverse/issues/5603. The plan after getting the formatted data to Jess is that she will work with it Tableau and R. We will meet next in the design meeting, 10am, Oct. 30.
@TaniaSchlatter maybe I misunderstood but I thought the target for the end of next week was a tabular file for the sample files in this repo. I just pushed b6cdf06 and it's full of hacks but I can at least provide you, @erikbuunk and Jess the following tabular file to look at:
files.tsv.txt (I have to add ".txt" to attached it to this GitHub issue).
Here's how it looks in a spreadsheet program (LibreOffice because I'm on Linux today):
I don't know if Jess is on GitHub or not so I'll assign this to you and Erik for review. I'll email Jess separately. Please note that the current sample data does not go to the level of depth we talked about in the meeting. Also, I hard coded the dates to today but I figure Jess can change a few of them manually if she wants to since we're only talking about 50 files.
The next steps after this as I see them:
Thanks @pdurbin for this awesome step. The structure of the data looks to me like what we discussed. The sample data was fine for working out the structure. What we want for Jess for end of next week is applying the formatting to real world dataverse data from the dataset referenced above, either for 6 months or the full year, depending on the number of lines. We talked about @ 75,000 lines being reasonable for Jess to work with. Maybe @erikbuunk can give the formatting a review as well to help confirm before moving forward to working with the larger dataset.
The structure look good.
The data will probably not lead not something any super interesting, yet, but probably enough to make a start. Every document set has 1 layer and the set with a 2nd layer is the only one in that specific tree (which means one extra circle).
Something like this:
@TaniaSchlatter I'm pretty sure Jess will get value out of the tabular file above. Yes, we'll get her more data. I may need to tap @scolapasta or @jggautier for their SQL-fu. :smile:
@djbrooke especially if Jess gets immediate value out of the file above or even just for fun, we might want to create some more datasets here in the sample data repo that are deeper down in a tree of dataverses. Or we could reorganize the datasets we have. That would resolve the problem of "level 3" columns being empty in the file above.
@erikbuunk thanks! Super helpful!
Oh, I did email the tabular file to Jess by the way. Have a good weekend, all!
@TaniaSchlatter @djbrooke @scolapasta @mheppler @jggautier and I talked about this during design standup this morning.
@djbrooke said he'll do the housekeeping in terms of creating new issues, etc. Thanks!
I just heard from Jess. Sounds like she got the file above and is having fun:
"Super impressive! Itβs working well. Canβt wait to get a few more minutes to more fully explore π
Feel free to give me a larger dataset at any point, maybe more representative? This one is very well-behaved."
We are all in 100% agreement that the next step is to give her more data, from production, no matter how ill behaved it is. π
The starting point will probably be the SQL scripts mentioned above. Here they are for safe keeping:
We might as well make them into an API endpoint, I'm thinking, so the issue should probably be created in https://github.com/IQSS/dataverse/issues
Thanks, added https://github.com/IQSS/dataverse/issues/6238
Given sample data files like these four...
We want a tabular file that looks like this:
Here's a downloadable version: data.tsv.txt
The task is to write a script to create this tabular file based on the latest sample data in this repo. (A future task will be to transform it into a nested JSON document to be compatible with the d3 code below.)
Here's the first file to show the hierarchy and publication date:
The eventual goal is to come up with something like the Zoomable Circle Packing visualization at http://bl.ocks.org/nbremer/667e4df76848e72f250b and in the screenshot below.
@TaniaSchlatter @erikbuunk and Jess, please let me know if I have any of this wrong! π