Preload/wrangle as much external data as possible

jthrilly commented 5 years ago

Following from #887

The long term solution to performance issues with loading external data is to calculate this data once at a non performance critical moment (such as starting an interview session, or when the protocol is installed), and then store the decoded and restructured data. Then, stream it in as we do with media assets.

We need to think about how this works with "true" remote assets, which are loaded by URL. #847

jthrilly commented 5 years ago

@wwqrd said:

I think it would be great to have this generate at interview start. Do we have any way of determining which bits are slow? If it's due to having large files/objects in memory then we may need to think of more alternative approaches, like a restful api, since we can't stream json.

Just to add: this issue doesn't impact JSON so badly. Perhaps we can stream CSV by doing line-by-line conversion.

Either way, our main resource limit will be memory.

wwqrd commented 5 years ago

I think line by line conversion will help. I think if we can also move the file reading to the worker script that would help too, is that possible? Will we still be able to limit worker script node access elsewhere, like in the labeling script?

I'm concerned that once large csvs are converted to json, we will see this issue with json too.

jthrilly commented 5 years ago

I think its the CPU intensiveness of all the iteration that kills the CSV performance, coupled with the fact it is happening in the main thread which is causing UI stuttering.

You are right that there might then be some things we can also do to do with react and rendering that would also improve loading JSON data too. I know Bryan was able to find some spurious re-renders in the node list component a while ago, and I wonder if some of that might have come back in.

This is kind of why we framed this issue in terms of "do all the data processing in a moment that is not performance critical (such as the start of the interview)".

Regardless of that, and just to make it really explicit, the problem we are trying to solve with this issue is: locally stored external data in CSV format causes interfaces that use it (name generator, right now) to stutter when loading without providing any user feedback. We believe the CSV loading operation might just be too CPU intensive to do in realtime.

jthrilly commented 5 years ago

Just to speak to the comment in my initial post about needing to reconcile this with true "remote assets" (those specified as URLs) - I think this is where optimizing the rendering of node lists is going to be more important, coupled with the asynchronous rendering and loading techniques we have discussed elsewhere, which will allow the rest of the app to continue feeling responsive.

wwqrd commented 4 years ago

This issue may be steered by a related feature https://github.com/codaco/protocol-validation/issues/38. We currently have the ability to migrate protocols to upgraded versions, but not external network files. If network data was added to protocol during protocol build, then this data could utilise that functionality to support migrations.

An argument against incorporating this information in the protocol file is that it could be slow for large networks - however, due to the limitations of cordova (or at least, the resources available to build features specifically for that platform), data is currently loaded all at once into memory anyway, so potential optimisation there is lost. One work around for this might be to insist that large data sets are driven by urls.

Once networks are incorporated into the protocol then the question arises: if the other assets could be incorporated into this file, then the additional complication of zip could be dropped. One place to look might be to use some kind of blob-based storage format. https://github.com/mafintosh/content-addressable-blob-store - this looks like it may be node only, but the Blob/File api itself is web one and generic in theory.

jthrilly commented 4 years ago

Excellent write-up, Steve.

Resolution to this issue looks like being:

This is actually how external data used to work in the very first protocol file format spec. In the new system we are proposing here, Architect would take network data in whatever format is supplied (we would support CSV, graphml, and possibly straight JSON - like used in sigma/d3), and convert it into the NC network JSON format: { nodes: [{}], edges: [{}] }. It would then add this to a property of the protocol object (possibly just under externalData).

This process could also work with the codebook transposition step. If the external data has entity attributes that exist in the codebook (based on name) these could be mapped to the appropriate variable UUIDs, and attributes that do not match could prompt the user to create variables for them automatically (specifying only variable type).

We should think about how this might work with remote assets. Perhaps we could structure the externalData structure in such a way as to have a URI if the asset is remote, and have that piece of state resolve to the actual data once it is fetched. We could then have a generic async selector that returns either (1) the data if it is locally embedded or already available, or (2) triggers a fetch action that updates the state with the data, and then returns it.

complexdatacollective / Interviewer

Preload/wrangle as much external data as possible #890