Read species name/taxon ID from GRNmap file and load into GRNstate

kdahlquist commented 5 years ago

This task is being separated from #698. GRNsight needs to read the species name and/or taxon ID from a GRNmap file and load it into GRNstate.

@kdahlquist needs to do a quick test to make sure that this won't mess up GRNmap. Then she will provide a sample file (probably will alter demos 1 and 2.

Note that we need to tackle other issues before moving on to this one.

dondi commented 5 years ago

@mihirsamdarshi will move to this after finishing #475 and #798.

mihirsamdarshi commented 5 years ago

@kdahlquist do you have an example file for me to use while working on this?

kdahlquist commented 5 years ago

I've made one quickly, but haven't had a chance to test it with GRNmap yet. I added a row for "species" in the "optimization_parameters" sheet and put "Saccharomyces cerevisiae" in there. I added a row called "taxon_id" and put 559292 in the field. Note that on a previous issue, we discussed that for yeast, Uniprot expects taxon ID 559292 for yeast, while JASPAR expects taxon ID 4932 for yeast and @jlopez616 did something to resolve this, but I don't remember what. I think we will need to discuss this further at the meeting tomorrow and yeast might not be the best species to use to test this.

15-genes_28-edges_db5_Dahlquist-data_input_with-species.xlsx

mihirsamdarshi commented 5 years ago

we actually parse the entire sheet, and add it to the network metadata, so this was relatively simple to add, and is a part of the #796 PR

kdahlquist commented 5 years ago

The complicated part is the testing. I'm torn as to whether to require the user to just look up the taxon ID or to let them use the genus/species name. There is always the problem of misspelling or choosing the wrong ID.

Kevboe commented 4 years ago

I spent a lot of time going through the code, and documenting the pieces that I need with both comments and more info with notes on my iPad. I created an object to serve as the name/taxon/id list. It's similar to a dictionary with a key: value where the key is the simple english name and the value is another object "dictionary" where the key is variable name and the value is the respective value for the var name and english name. What I have done to work around the issue with Jaspar and Uniprot to have two different variables, which can be the same or different depending on what is needed. I also wrote a method for reading in and setting the values for the GRNstate. It runs for both the taxon id or species name, which allows for some user error in the input from the workbook. The workbook can also be expanded to allow for a Jaspar and Uniprot id. This is where tests will need to be written and I will need help doing that. I also ran into a problem with the current test. I can leave the GRNstate vars set to yeast to allow the tests to pass, but I don't think this is proper technique. What I want to discuss is how we should update our tests and possible sample data when we have the ability to read in the species. Does it become mandatory, or a feature that may or may not be used. This will all change how the tests are written and handled. Currently, using the sample data from this thread, I can say that my method for setting worked.

kdahlquist commented 4 years ago

To answer one question, uploading the species name/id is not mandatory.

One thing we talked about was the form of the species name that is supplied by the user in the GRNmap workbook. Since the species name has a space character in between the genus and species name and the API queries fill this space with an underscore, we talked about where the underscore needs to be added.

@dondi suggests that the conversion be done at the point we call the API, in case we encounter API's that want a different format.

kdahlquist commented 4 years ago

Let's create an info box right about the layout box on the left hand side of the interface so there can be a passive indicator to the user which species is in use. We talk about when to actively notify the user later.

Discussed having a "Data" menu item for both @afiller1 and @Kevboe 's features, but it is something that is accessed more rarely so it should not take up prime real estate.

For now, let's stick with defaulting to yeast initially.

dondi commented 4 years ago

@Kevboe please aim to issue a pull request showing this change by Monday. As a stretch goal, you can explore a possible UI treatment for how the user might change that value. (e.g., a modal dialog where the user just enters the species name and taxon ID, for starters) We can iterate over this design next.

dondi commented 4 years ago

@Kevboe Merge beta into your branch if you haven't done so recently; a few fixes have landed there which you may find helpful.

dondi commented 4 years ago

@Kevboe If you do accomplish the stretch goal, this will allow us to verify whether the current grnState species and taxon are being correctly passed along to the gene page APIs.

dondi commented 4 years ago

@Kevboe For testing, you will want to follow the pattern where:

You prepare a minimal GRNmap workbook file with the species information (possibly at least 2, one that has the info, and another that doesn't, so that you can test default behavior)
Write a test that opens this file, then checks the resulting GRNstate value to see if it received the correct species information
You can pattern this test on something like this: https://github.com/dondi/GRNsight/blob/beta/test/errors-adjacency-matrix-modifications.js —there are some helper functions involved here that you may also need to read through

Kevboe commented 4 years ago

I was able to accomplish the goal of having a menu on the sidebar that both displays the identified species' Latin name, as well have an indicator of whether the species was able to be identified. Along with the display I added a drop down menu that lets the user select a species if the identified one was incorrect, or just to change it if they want to. The code I wrote works, but I feel that it could be more dynamic than it currently is, and think that I will rewrite it next week. The menu is also not aesthetically pleasing, but I think that this can be solved with some simple CSS changes. The biggest issue that I ran into was merging beta into my working menu branch. I resolved all merge conflicts, but when I went back to test my code I was running into errors launching it. The errors seemed to come from code which I had not changed. Once the errors are resolved, and the visuals are up to standard I will be ready for the pull request.

Kevboe commented 4 years ago

There is a commit on my menu branch with a working version of the menu pre beta merge, so you can see what work was done, and how it works.

dondi commented 4 years ago

I’m seeing mostly “Module not found” errors there, which leads me to think that a fresh npm install will do the trick. If a plain npm install doesn’t do it, exercise the “nuclear option” and delete node_modules first, and then npm install.

The permission error up top is less clear. I’ll have to think of that one a bit more.

Kevboe commented 4 years ago

Ran the npm install, here is what I have now. We can talk about this tomorrow.

dondi commented 4 years ago

The follow-up error shown indicates that npm install did work, but now the issue is an unresolved merge conflict. @Kevboe was given some tips on how to ensure that all merge conflicts are resolved:

Monitor git diff to make sure all conflict markers have been resolved
Search across files for >>>>> since that occurs pretty much only when git has marked a conflict

Goals @Kevboe for the next two weeks:

Resolve all conflicts so that current work can be issued a PR which @dondi can place in the beta branch
Write some tests for code written where feasible
See how much further you can go with the follow-up issues along this track
Collaborate with @afiller1 on UGRS presentation slides for a review and dry-run on the March 16 meeting

kdahlquist commented 4 years ago

I have reviewed beta 4.0.12 and have the following comments:

The species identification is not actually working right now. I'm not sure if it is meant to be, but after I load a demo and right click on a gene, the species on the gene page is Yersinia enterocolitica (type O:3) and I thought we were going to default to Saccharomyces cerevisiae.
Now that I see it, I don't think the right position for the species information is the top left. I think we can create it as a collapsible section below the view menu. It would be a section called "Species". and would open up when someone clicked on it.
Inside the section, there would be a drop down menu that looks like the one for node coloring that has the species options. It would say "Species" above the drop-down menu in the same font size as used for "Top Dataset". What species is showing in the drop-down would be the one that was selected. Then we would not need to have the message "Species Identified" there.
The species names should not be centered.
The binomial nomenclature for species is actually two names, the first of which is the genus and the second is the species name. The genus name is capitalized and the species name is not. So the proper way to write yeast is "Saccharomyces cerevisiae" with cerevisiae with a lower-case "c". These names should also be italicized, but that is less important if it's hard to do.
Let's put the names in alphabetical order.
Homo sapiens is missing an "s" at the end of "sapiens".

Kevboe commented 4 years ago

All of the proposed changes to the design have been made. The species menu is now in line with the other menus. It is a collapsible section titled Species and currently only populated with a single species selection drop down menu. The menu allows the user to select the species of the spreadsheet they uploaded. The drop down menu also updates to reflect the species that the program reads in, if it can. I also made the grammatical changes. Finally, I started to add my changes to the GRNstate to the pass through for the gene pages. I have only started, but from what I can tell it looks like I was able to have all the GRNstate data port through, and even thought I see an error in the console, it looks like all of the APIs are correctly queried and return data. All this work is on the same branch that was merged.

dondi commented 4 years ago

User interface looks good upon initial review but we can go deeper after it goes on beta.

^^^^^ To get this on beta, we need a pull request so that this can be merged into beta for review

Next steps given this progress:

Ensure that species name matches what the APIs expect; for example, "S. cerevisiae" should have an underscore instead of a space
Check that the API code now uses the new taxon* parameter names; master version currently uses just taxon (no suffix)
The newly-discovered NCBI rate limit will be a separate issue

kdahlquist commented 4 years ago

Here is some feedback on the UI currently on beta (v4.0.13):

Instead of saying "Identified Species", it should say "Current Species"
Please remove the colon after "Species"
Please Bold "Current Species"
We are planning on defaulting to Saccharomyces cerevisiae, so that one should appear in the drop-down menu as the default, unless something else is chosen (via the uploaded workbook or otherwise)
We need to mirror this panel in the menu options in the menu bar. Since @afiller1's stuff is going to go in the Node menu, then we should add a "Species" menu in between the View and Help menus.

Kevboe commented 4 years ago

All suggested changes have been made, the pull requests have been issued and accepted. It was more work than I initially expected to add the mirrored panel to the top, but I think it turned out phenomenal, and is both visually pleasing and very functional. In addition to that I was able to go through my code, and,with some help from @dondi for the regexes, fix the API code. From what I can see I was able to successfully pass in all the variables and all of the APIs are queried properly.

dondi commented 4 years ago

Now that UI is mostly settled (with just one font change request to make it more consistent with other sidebar panels) and the API calls restored, the next immediate step for @Kevboe is to go through a cycle of real-world testing, particularly factoring in that the species might now change.

Logic for switching from YeastMine to other InterMine databases will also need to be done next, with the switch being based on the chosen species. The content in these database is expected to vary depending on the species (see issue #724).

dondi commented 4 years ago

@Kevboe should look up a small number genes from species other than S. cerevisiae and create test workbooks with those genes and test to make sure that calls involving other species now still work. The general theme is overall testing of new functionality to make sure things work.

The error message for missing/incorrect parameters should also be updated to cover the possibility that the wrong species is being queried.

kdahlquist commented 4 years ago

The current error message is this:

No gene information was retrieved for YHP1.

This could have happened because either GRNsight could not access the gene information from one of the source databases or because no information exists for the gene in the source databases.

_You can check back later to see if gene information can be retrieved or submit an issue to https://github.com/dondi/GRNsight.__

Please change it to this:

No gene information was retrieved for YHP1.

This could have happened because:

The wrong species is selected, or
GRNsight could not access the gene information from one of the source databases, or
No information exists for the gene in the source databases.

You can check back later to see if gene information can be retrieved or submit an issue to https://github.com/dondi/GRNsight._

Kevboe commented 4 years ago

Made the changes that Dr. Dahlquist asked and issued a PR to beta. Also worked on and presented for the URS last week which (I think) went well. For the creation of small species specific networks I ran into issues that I need help with, and was not sure how to move on. When I was creating the networks, and uploading them I was getting errors from, but only sometimes. I couldn't tell if it was a problem with the code, or that the gene that I picked didn't exist in all databases. I believe the problem is with the data, which needs to be verified as existing, but it could be an issue with the code.

kdahlquist commented 4 years ago

Verified that the new error message is displaying in beta 4.0.16

This needs to be tested with all six species that we support.

I.e., you need to create six input workbooks that have what is expected for each species.

Then you need to handle the issue when a user uploads a species that we don't expect.

kdahlquist commented 4 years ago

Cases:

Six worbooks for Arabidopsis, fly, nematode, yeast, mouse, human
workbook with no species data gives a warning that no species data was detected and that GRNsight defaults to yeast
situation where something is there, but is not readable

dondi commented 4 years ago

When all sheets are finished and manually verifiable, they can be added to our test set and unit tests can be written around them.

kdahlquist commented 4 years ago

Warnings to user:

"No species information was detected in your input file. GRNsight defaults to Saccharomyces cerevisiae. You can change the species selection in the Species menu or panel."

"GRNsight detected the species , in your input file. This is not one of the supported species. You can change the species selection in the Species menu or panel."

For now, let's not get fancy about detecting species we don't support. The user can figure it out if they made a typo or something.

We need to document on the Documents webpage what we expect the formatting of the species information to be in an uploaded workbook.

kdahlquist commented 4 years ago

These are the species/taxon pairings that need to be tested.

Homo sapiens, taxon ID 9606 (human)
Drosophila melanogaster, taxon ID 7227 (fruit fly)
Caenorhabditis elegans, taxon ID 6239 (nematode worm)
Mus musculus, taxon ID 10090 (house mouse)
Arabidopsis thaliana, taxon ID 3702 (thale cress)

Kevboe commented 4 years ago

Error message has been made, and the test file have been created. I was able to manually test, bugfix, and verify it works, but did not make tests yet. I have issued a pull request to beta with these changes.

kdahlquist commented 1 year ago

backlogging all gene page issues

dondi / GRNsight

Read species name/taxon ID from GRNmap file and load into GRNstate #720