Closed pfsj closed 5 years ago
Currently looking to support three export formats initially:
@pfsj - how would you feel about having CSV and GraphML be our formats for R? It looks like iGraph can import either. ("as_edgelist, pajek, graphml, gml, ncol, lgl, dimacs and graphdb": http://igraph.org/r/doc/igraph.pdf). Are there reasons to provide an R specific format?
Just to add that we will want to rewrite our graphML generation, because the current one is basically hacked together from my prototype and is wonky as all hell.
@pfsj - how would you feel about having CSV and GraphML be our formats for R? It looks like iGraph can import either. ("as_edgelist, pajek, graphml, gml, ncol, lgl, dimacs and graphdb": http://igraph.org/r/doc/igraph.pdf). Are there reasons to provide an R specific format?
CSV is definitely harder than you'd think to get to work with the main two packages. GraphML is easier but not something folks are entirely familiar with (and does not work for statnet/sna packages). The idea was to make it non-package dependent but I think it's a reasonable cost saver and we can still make it relatively easy with clear code/documentation of converting things.
I can't find any clear documentation on the formats supported by statnet. CSV (adjacency matrix + vertex attribute list) seems to be the way most tutorials use data. Similar with the sna package.
Seems like we need (in terms of CSV formats):
The above + graphML seems to cover everything. Or are there other formats better for statnet/sna? Can you confirm @pfsj?
Pat has confirmed the above. @bryfox - could you reach out to pat/myself about any questions you have with this?
_uid
s as identifiers?Given the formats we're currently using, GraphML export relies on both the variable registry and network data. The variable registry can be updated by an interviewer, though protocols aren't versioned. Technically, this means we have no way of knowing (again, with today's formats) if the types defined in the variable registry were defined the same way when the network was produced in NC. And consistency between sessions isn't enforced. I could see this being unexpected, or a feature.
Are the existing formats sufficient, or do we need to define types more carefully in export from NC? If the latter, how do we handle conflicting types on export from Server?
(Edit: other export formats will also be affected by edge directionality, which is proposed as a protocol-level property; see https://github.com/codaco/Network-Canvas/issues/477.)
Sociogram edges have a direction internally, but that isn't really exposed in the UI there. Should I assume directed edges for export? Do we need an option here until that's present in NC?
Sociogram edges have a direction internally, but that isn't really exposed in the UI there. Should I assume directed edges for export? Do we need an option here until that's present in NC?
We have talked about implementing directed edges later on in the development process so we definitely want some way to indicate that in the data export.
My thoughts on a couple of these:
- Is one able to export a single file from a union of multiple sessions, or does 'single file' export just refer to exporting a single session?
I'm not clear what a "session" here refers to (i'm not up on the internal definitions) but I believe the reference to 'single file' just means exporting multiple interviews into a single dataset (although in the case of the CSV a "single dataset" would technically have at least two files - node attributes and edges). This is on contrast to how personal network data is sometime exported as separate files for each interview.
- I assume exporting 'multiple files' means each session is written to a separate file, but the actual download is something like a single zip of multiple CSVs.
This makes sense to me. Although, I'm not sure if we even need to support this anymore, @jthrilly?
That makes sense, thanks. 'Session' is synonymous with 'interview' to me. I think the Server UI uses the word 'session' in a couple of places; I can change that to 'interview' if it makes more sense.
- I assume filtering & export is at the level of a protocol, and that one wouldn’t export data from different protocols as a single file
Yep. The protocol "workspace" (or whatever we are calling them) is the unit here.
- Are we still supporting ‘multiple file’ export?
I would like to do this, but I don't want to make the development too complicated. If just doing a single format at a time simplifies things, then please go ahead and do that.
I had imagined that a UI for this might consist of some sort of multi-select accordion, where selecting the top level file format for export would expand the panel to reveal any export options associated with that format.
- For adjacency matrices and lists, do we use network canvas _uids as identifiers?
Yes, I think this makes sense. However, this might be something that could be parameterised and specified in the options (see above)?
- Are there requirements or design mockups for the question flow?
There aren't. If it would be helpful to have them, I can definitely work with Pat to produce them for you.
- does NC export need to be strongly typed?
I think the strictness will depend on the file format for the export, and will definitely need to be tweaked after the fact. Unfortunately, some common network analysis applications behave in unusual ways with regards certain data types, even if the underlying file format is unambiguous about their representation.
Let's start with enforcing the data types as closely as possible, and then relax this as we learn more.
Let's start with enforcing the data types as closely as possible
I see two possible ways of doing this:
"age": 23
could become "age": { "value": 23, "type": "integer" }
‘multiple file’ export
I had assumed this referred to exporting multiple individual interviews; is there no way to do that, then? Is a user only able to export the union of all interviews?
re: specifying the adjacency labels in the options (i.e., instead of "_uid"), how do we treat values that may not be unique? Best effort? Error? Warn but deliver output?
- export the variable registry along with each session. This mimics behavior of NC, but would result in a lot of overhead/waste, and probably hinder scalability on Server.
Doesn't server already have the variable registry, since it has the protocol file? 🤔
I had assumed this referred to exporting multiple individual interviews; is there no way to do that, then? Is a user only able to export the union of all interviews?
I'm sorry, I thought you were referring to multiple file types! Whoops.
You are correct that we want to be able to choose between a monolithic network export, and individual sessions in individual files.
re: specifying the adjacency labels in the options (i.e., instead of "_uid"), how do we treat values that may not be unique? Best effort? Error? Warn but deliver output?
I hadn't thought about this. It seems complex, so let's abandon that for now and just use _uid
. Users can tweak things after the fact.
Doesn't server already have the variable registry, since it has the protocol file?
It has a registry, but it need not match the version that was used for the interviews. In the original question, I said this might be a feature or a bug, since it's likely going to gloss over any type changes or mismatches.
The current approach is to assume that the version Server has is the same as (or close enough to) the variable registry that was used when the interview took place. That is: Server and NC work with the current network format, which does not encapsulate these 'extended' data types.
The data export process should guide users through a series of questions that lead to the selection of the type/location of data export.
1) Export in single or multiple files 2) Filter data based on criteria (e.g., only certain edges) 3) Select file type (based on possible types that apply given previous steps) and download location
Discussion of file types can be found in #2 and #7