Data export interface - Githubissues

pfsj commented 6 years ago

The data export process should guide users through a series of questions that lead to the selection of the type/location of data export.

1) Export in single or multiple files 2) Filter data based on criteria (e.g., only certain edges) 3) Select file type (based on possible types that apply given previous steps) and download location

Discussion of file types can be found in #2 and #7

bryfox commented 5 years ago

Currently looking to support three export formats initially:

GraphML (already implemented in NC)
CSV
an intermediate format (TBD) for better integration with R

jthrilly commented 5 years ago

@pfsj - how would you feel about having CSV and GraphML be our formats for R? It looks like iGraph can import either. ("as_edgelist, pajek, graphml, gml, ncol, lgl, dimacs and graphdb": http://igraph.org/r/doc/igraph.pdf). Are there reasons to provide an R specific format?

jthrilly commented 5 years ago

Just to add that we will want to rewrite our graphML generation, because the current one is basically hacked together from my prototype and is wonky as all hell.

pfsj commented 5 years ago

@pfsj - how would you feel about having CSV and GraphML be our formats for R? It looks like iGraph can import either. ("as_edgelist, pajek, graphml, gml, ncol, lgl, dimacs and graphdb": http://igraph.org/r/doc/igraph.pdf). Are there reasons to provide an R specific format?

CSV is definitely harder than you'd think to get to work with the main two packages. GraphML is easier but not something folks are entirely familiar with (and does not work for statnet/sna packages). The idea was to make it non-package dependent but I think it's a reasonable cost saver and we can still make it relatively easy with clear code/documentation of converting things.

jthrilly commented 5 years ago

I can't find any clear documentation on the formats supported by statnet. CSV (adjacency matrix + vertex attribute list) seems to be the way most tutorials use data. Similar with the sna package.

Seems like we need (in terms of CSV formats):

Adjacency matrix
edge list (different format from matrix?)
Attributes list

The above + graphML seems to cover everything. Or are there other formats better for statnet/sna? Can you confirm @pfsj?

jthrilly commented 5 years ago

Pat has confirmed the above. @bryfox - could you reach out to pat/myself about any questions you have with this?

bryfox commented 5 years ago

I assume filtering & export is at the level of a protocol, and that one wouldn't export data from different protocols as a single file
Is one able to export a single file from a union of multiple sessions, or does 'single file' export just refer to exporting a single session?
I assume exporting 'multiple files' means each session is written to a separate file, but the actual download is something like a single zip of multiple CSVs.
For adjacency matrices and lists, do we use network canvas _uids as identifiers?
Are there requirements or design mockups for the question flow?

bryfox commented 5 years ago

Given the formats we're currently using, GraphML export relies on both the variable registry and network data. The variable registry can be updated by an interviewer, though protocols aren't versioned. Technically, this means we have no way of knowing (again, with today's formats) if the types defined in the variable registry were defined the same way when the network was produced in NC. And consistency between sessions isn't enforced. I could see this being unexpected, or a feature.

Are the existing formats sufficient, or do we need to define types more carefully in export from NC? If the latter, how do we handle conflicting types on export from Server?

(Edit: other export formats will also be affected by edge directionality, which is proposed as a protocol-level property; see https://github.com/codaco/Network-Canvas/issues/477.)

bryfox commented 5 years ago

Sociogram edges have a direction internally, but that isn't really exposed in the UI there. Should I assume directed edges for export? Do we need an option here until that's present in NC?

pfsj commented 5 years ago

Sociogram edges have a direction internally, but that isn't really exposed in the UI there. Should I assume directed edges for export? Do we need an option here until that's present in NC?

We have talked about implementing directed edges later on in the development process so we definitely want some way to indicate that in the data export.

pfsj commented 5 years ago

My thoughts on a couple of these:

Is one able to export a single file from a union of multiple sessions, or does 'single file' export just refer to exporting a single session?

I'm not clear what a "session" here refers to (i'm not up on the internal definitions) but I believe the reference to 'single file' just means exporting multiple interviews into a single dataset (although in the case of the CSV a "single dataset" would technically have at least two files - node attributes and edges). This is on contrast to how personal network data is sometime exported as separate files for each interview.

I assume exporting 'multiple files' means each session is written to a separate file, but the actual download is something like a single zip of multiple CSVs.

This makes sense to me. Although, I'm not sure if we even need to support this anymore, @jthrilly?

bryfox commented 5 years ago

That makes sense, thanks. 'Session' is synonymous with 'interview' to me. I think the Server UI uses the word 'session' in a couple of places; I can change that to 'interview' if it makes more sense.

jthrilly commented 5 years ago

I assume filtering & export is at the level of a protocol, and that one wouldn’t export data from different protocols as a single file

Yep. The protocol "workspace" (or whatever we are calling them) is the unit here.

Are we still supporting ‘multiple file’ export?

I would like to do this, but I don't want to make the development too complicated. If just doing a single format at a time simplifies things, then please go ahead and do that.

I had imagined that a UI for this might consist of some sort of multi-select accordion, where selecting the top level file format for export would expand the panel to reveal any export options associated with that format.

For adjacency matrices and lists, do we use network canvas _uids as identifiers?

Yes, I think this makes sense. However, this might be something that could be parameterised and specified in the options (see above)?

Are there requirements or design mockups for the question flow?

There aren't. If it would be helpful to have them, I can definitely work with Pat to produce them for you.

does NC export need to be strongly typed?

I think the strictness will depend on the file format for the export, and will definitely need to be tweaked after the fact. Unfortunately, some common network analysis applications behave in unusual ways with regards certain data types, even if the underlying file format is unambiguous about their representation.

Let's start with enforcing the data types as closely as possible, and then relax this as we learn more.

bryfox commented 5 years ago

Let's start with enforcing the data types as closely as possible

I see two possible ways of doing this:

export the variable registry along with each session. This mimics behavior of NC, but would result in a lot of overhead/waste, and probably hinder scalability on Server.
change the server export format (or at least db representation) entirely. e.g., "age": 23 could become "age": { "value": 23, "type": "integer" }
- biggest downside: we can no longer share networkQuery functionality for filtering (at least, as it is; we could have conditionals for different network formats.)
- ...unless we change the network format internal to NC as well

bryfox commented 5 years ago

‘multiple file’ export

I had assumed this referred to exporting multiple individual interviews; is there no way to do that, then? Is a user only able to export the union of all interviews?

bryfox commented 5 years ago

re: specifying the adjacency labels in the options (i.e., instead of "_uid"), how do we treat values that may not be unique? Best effort? Error? Warn but deliver output?

jthrilly commented 5 years ago

export the variable registry along with each session. This mimics behavior of NC, but would result in a lot of overhead/waste, and probably hinder scalability on Server.

Doesn't server already have the variable registry, since it has the protocol file? 🤔

I had assumed this referred to exporting multiple individual interviews; is there no way to do that, then? Is a user only able to export the union of all interviews?

I'm sorry, I thought you were referring to multiple file types! Whoops.

You are correct that we want to be able to choose between a monolithic network export, and individual sessions in individual files.

re: specifying the adjacency labels in the options (i.e., instead of "_uid"), how do we treat values that may not be unique? Best effort? Error? Warn but deliver output?

I hadn't thought about this. It seems complex, so let's abandon that for now and just use _uid. Users can tweak things after the fact.

bryfox commented 5 years ago

Doesn't server already have the variable registry, since it has the protocol file?

It has a registry, but it need not match the version that was used for the interviews. In the original question, I said this might be a feature or a bug, since it's likely going to gloss over any type changes or mismatches.

The current approach is to assume that the version Server has is the same as (or close enough to) the variable registry that was used when the interview took place. That is: Server and NC work with the current network format, which does not encapsulate these 'extended' data types.

complexdatacollective / Server

Data export interface #22