dondi / GRNsight

Web app and service for modeling and visualizing gene regulatory networks.
http://dondi.github.io/GRNsight
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Give a warning or error if adjacency matrix is too large #165

Closed dondi closed 9 years ago

dondi commented 9 years ago

When a file upload is taking a while, the entire app actually remains active and the user can interact with it. This can cause issues when the upload does finish, including when the upload is too large and the server responds with error 413. Thus it is probably better to display a UI-blocking modal showing that an upload is in progress. Most of the time, this will be up very briefly. But at least, for longer uploads, the user can wait. The modal should show an indeterminate progress bar with a cancel button. Cancelling the upload can be acknowledged, but because we cannot actually cancel a network request, the code would wait for the response to come back, then simply ignore it.

Revision: Do not show the modal until after 1 second has elapsed; this will avoid a "flash" effect when the upload turns out to be quick. Requires coordination between setTimeout, clearTimeout, and the callback.

kdahlquist commented 9 years ago

Now that incorrect file types can no longer be uploaded, the issue is narrowed to very large .xlsx files. @anuvarsh to fabricate and test such a file first.

anuvarsh commented 9 years ago

The following .xlsx test files can be found in test-files>graph-tests>different-sized-networks

1-gene-0-edges.xlsx: functional and appears instantly 1-gene-1-edges.xlsx: functional and appears instantly 5-genes-max-edges.xlsx: functional and appears instantly 7-genes-max-edges.xlsx: functional and appears instantly 9-genes-max-edges.xlsx: functional and appears instantly 10-genes-0-edges.xlsx: This spreadsheet returned an error similar to the missing value error in #189. After copy and pasting the matrix into a different sheet, everything is functional. 10-genes-50-edges.xlsx: functional and appears instantly 10-genes-90-edges.xlsx: functional and appears instantly 10-genes-max-edges.xlsx: functional and appears instantly 20-genes-max-edges.xlsx: functional and appears instantly 25-genes-max-edges.xlsx: functional and appears instantly 30-genes-max-edges.xlsx: functional and appears instantly 35-genes-max-edges.xlsx: functional but appears with slight lag 40-genes-max-edges.xlsx: functional but appears with slight lag 42-genes-max-edges.xlsx: functional but appears with slight lag 45-genes-max-edges.xlsx: functional but appears with slight lag 50-genes-max-edges.xlsx: functional but appears with lag 52-genes-0-edges.xlsx: functional 52-genes-max-edges.xlsx: functional but appears with lag 55-genes-0-edges.xlsx: functional 55-genes-max-edges.xlsx: Ran for ~5 min but nothing appeared 65-genes-0-edges.xlsx: functional 70-genes-0-edges.xlsx: functional 80-genes-0-edges.xlsx: functional 90-genes-0-edges.xlsx: functional 100-genes-0-edges.xlsx: functional 110-genes-0-edges.xlsx: functional 120-genes-0-edges.xlsx: functional 130-genes-0-edges.xlsx: functional 140-genes-0-edges.xlsx: This spreadsheet returned a duplicate gene error, but there was not a duplicate of the gene mentioned in the error message (YSA1). regulation-matrix-documented-20140709-AIITF-all-targets.xlsx (129 nodes, 4154 edges): Ran for ~5 min but nothing appeared

52-genes-max-edges.xlsx holds the max number of edges that the server can handle.

It appears as if the last two files are just too big, and that trying to upload them stalls the entire server. In the case of such files, should there be an error message explaining that the file is too large?

dondi commented 9 years ago

Files that uncovered previously unanticipated errors to be wrapped into unit tests (which temporarily fail—red hat). Debugging can then be iterated until the tests succeed (green hat). Committing the tests will preserve our knowledge about this bug for future generations ;-)

dondi commented 9 years ago

https://www.youtube.com/watch?v=D9GQ9nBHhIc&feature=youtu.be&list=PL06ytJZ4Ak1quT7gKXcNKD3lj

anuvarsh commented 9 years ago

Unit tests for 10-genes-0-edges and 140-genes-0-edges were created. They both fail as of right now.

NAnguiano commented 9 years ago

For the ones returning duplicate gene errors: I two separate variables (i and j) for choosing which value of the array was the duplicate: one for source genes (i) and one for target genes (j). I accidentally used the exact same variable, i, for displaying which gene was the duplicate for both source and target. As a result, the source gene duplicates would show up correctly, but the target gene duplicates would always say they were whatever was in row i, which usually was the last gene in the list. That bug has been fixed, and the duplicates are now showing up correctly for both.

anuvarsh commented 9 years ago

The max number of edges that we can upload to GRNsight while maintaining (laggy) functionality is 53 nodes and 2704 edges. At 53n and 2705e, the console returns the following error: image @NAnguiano hypothesized that this is because the javascript failed because there are just too many edges.

In this case, almost every node has a max number of connections. Because it is unlikely that any actual data set would replicate this level of connectivity, it might be beneficial to see how many edges can be handled at a higher number of nodes. For example, it is possible that our users might try to upload a network with 150-200 nodes. If we could determine the max number of edges for a network of that size, we could more realistically create a node/edge maximum.

anuvarsh commented 9 years ago

140-genes-0-edges.xlsx is now a working file. @NAnguiano found that the variables for source vs. target were switched, so the error message was returning the wrong gene name for the duplicate. Once this was fixed, it was found that MSI1 was actually a duplicate gene in the file, and the error message returned that gene name as it should.

10-genes-0-edges.xlsx is now working as well. More details on this fix in #189.

anuvarsh commented 9 years ago

With the fixes that @NAnguiano made for the files in #189 and for the files in this issue, 53-genes-2705-edges.xlsx now uploads with a lag, so a new number of max edges needs to be determined.

anuvarsh commented 9 years ago

Currently, all of the files load up except regulation-matrix-documented-20140709-AIITF-all-targets.xlsx, including 150-genes-max-edges.

When updating the spreadsheet controller to make the files in this issue and issue #189 upload without errors, @NAnguiano changed some of the code, and as a side effect, some of these extremely large files are now loading up without problem. The spreadsheet controller looks at the whole file because there is no way to define what is part of the matrix and what is not, so the controller just looks at all of the rows and columns where the is data. When there is data associated with a source AND a target gene, the data appears as an edge on the network, whereas any data not associated with both a source and target gene is treated as extraneous data (and appears in our new warning system). Not completely sure how this change affected the website's ability to handle larger files quite yet.

NAnguiano commented 9 years ago

Considering it's graph.js that controls the drawing of the graphs, I'm genuinely not sure why beta is operating better than master does. I haven't changed graph.js in any significant way (I've simply added a little jQuery that adds the warnings modal text), so they should theoretically be working the same way. I wonder if this may be something to do with the beta server vs the master server, as the spreadsheet controller theoretically shouldn't have anything to do with the inability for the graphs to be drawn...

NAnguiano commented 9 years ago

Annnd I found exactly what causes it. There is actually physically something wrong with the 53-genes-max-edges file that is not being caught by the master server.

Back before I did all the shenanigans with adding in the checks for NaN's and undefineds, I found a seeming random little bug that was a copy-paste error. When error checking, all genes are made upper case in order to make it so that case doesn't matter. However, there was one spot where I'd failed to make the genes upper case. I fixed this bug and moved on with life.

The 53-genes-max-edges file does not work on beta before this bug is fixed. However, it does work immediately after it's fixed. As a result, I think what we were seeing was not an upper limit, but rather finding the one gene that managed to not get it's case changed. Since this gene doesn't have any associated data, it's effectively undefined, rendering the js unable to work with it, and in thus causing the graph drawing to fail.

Edit: To further prove this point, I tried to upload the 150-genes-max-edges spreadsheet to master. It's slow, but it loads just fine. This further confirms that the file simply had a bug that stopped it from working at that point, and it resolved itself once the bug was fixed.

kdahlquist commented 9 years ago

As @anuvarsh and I discussed this morning, we already know that GRNsight can display a larger number of nodes and edges than it is practical to display in our bounding box. So, yes, we can display large networks, but once you go over a certain number of nodes and edges, it's not really worth visualizing with GRNsight.

So, I propose that @anuvarsh simply figure out what is the maximum number of nodes that can be evenly spread out in our bounding box and we'll call that our upper limit for the number of nodes and the max edges for those nodes can be the max edges.

We'll actually set two limits:

For example if the first limit was 50 nodes and 100 edges, the absolute limit might be 75 nodes and 150 edges, or something like this.

@anuvarsh please comment on what you've found out regarding this.

I'm also changing the name of this issue to better reflect what it is talking about.

anuvarsh commented 9 years ago

I think it would be fair to put a soft max at about 40-50 genes and 80-100 edges, and a hard limit at about 70 genes and 150-200 edges.

At 40 genes, the screen looks something like this: image

At 50 genes: image

At 60 genes: image

And at 80 genes: image In this graph, I tried to make 7 columns of 10 genes each, but even without the extra genes it was too much.

For reference, I created a couple networks with 34 genes to see how readable graphs could be with a varying number of edges. 34 genes with 90 edges: image

34 genes with 65 edges: image

34 genes with 40 edges: image

The graphs weren't completely clear until 34 genes and 40 edges, though that could be due to poor organization. Either way, pushing the soft max past 50 won't lend to coherent graphs unless there are a minimal number of connections.

anuvarsh commented 9 years ago

After talking to @kdahlquist, we decided that in documentation we would recommend 35 nodes and 70 edges, warn users that their networks are too large at 50 nodes and 100 edges, and then return an error and not produce a graph at 75 nodes and 150 edges. These limits will apply if either nodes or edges surpasses the limits that we create.

This means that there cannot be more than 35, 50 or 75 unique nodes that participate in the network at any given time. Considering that we now allow asymmetrical networks, we need to count the number of unique labels as opposed to the number of rows or columns.

At the point we need to:

anuvarsh commented 9 years ago

@dondi, I started work on putting in an error message for large networks, but I'm not sure where to put this down in the spreadsheet-controller? I already have the error in the error list, but I'm not sure where to actually put the if statement in.

dondi commented 9 years ago

I don’t know that part of the code cold, but I would start at the point where the file is parsed up then follow the code to where the nodes get pulled out of the parsed object. That is where you will know how many there are, and thus that is where you can insert an if statement that adds the error message. Tallying up of edges probably happens in the same general vicinity.

anuvarsh commented 9 years ago

Warning and error messages are both live on beta.

anuvarsh commented 9 years ago

On the Documentations page, under Input Spreadsheet, the second to last bullet point used to say: "Note that GRNsight is designed to visualize “medium-scale” gene regulatory networks, not the entire gene regulatory network for an organism. Currently, the bounding box for display of the graph has a fixed size. Adjacency matrices with a large number of nodes or edges may not load, or if they do, may just look like “spaghetti”."

But I updated it to say: "Note that GRNsight is designed to visualize “medium-scale” gene regulatory networks, not the entire gene regulatory network for an organism. Currently, it is recommended that you upload networks with no more than 35 unique genes or 150 edges."

kdahlquist commented 9 years ago

Please replace the text you have with the following:

Note that GRNsight is designed to visualize small- to medium-scale gene regulatory networks, not the entire gene regulatory network for an organism. Currently, the bounding box for display of the graph has a fixed size. Adjacency matrices with too large a number of nodes or edges will just look like “spaghetti". We recommend that you visualize networks with no more than 35 unique genes and no more than 70 edges. If you upload a file with an adjacency matrix with 50 unique genes and/or 100 edges, you will get a warning, and the graph will still display. If you upload a file with 75 or more unique genes and/or 150 edges, you will get an error and the graph will not display.

dondi commented 9 years ago

Sheets are present and unit tests have been coded up.