adkinsrs commented 2 years ago

This is a continuation of https://github.com/nemoarchive/analytics/issues/104 and https://github.com/nemoarchive/analytics/issues/111

Tagging carlocolantuoni to notify him of the new ticket, and I will close the existing tickets

Will make some subtasks for each of the diagrammed areas in the screenshot.

Currently a basic demo exists on umgear.org/projection.html but it only works for specific datasets. But we have the display and gene list architecture in place to integrate this into the front page/gene results page.

jorvis commented 2 years ago

Here's the recording from our overview meeting on Friday:

https://zoom.us/rec/share/sTRLpBzpVTR_Oa_1wslLL2XwozGWqPFqViomY5tpbfWbxLWHJQCDCMMOWVMxQ8Q.8cf1SuBuoyQ336iN

Passcode: #v=Db34c

adkinsrs commented 2 years ago

ProjectR scratch notes

NeMO Archive

Holds the weighted gene lists
- Can grab some from other DB sources (believe this is lower priority)
Holds tab_counts files (ROWmeta, COLmeta, DataMTX) which can be converted to H5AD
- Should we still bundle and transfer to NeMO Analytics

NeMO Analytics

Keep in mind this needs to ideally work on any gEAR portal

API

Run ProjectR
- Loading input (all dataframe rows are genes)
  - Past analyses within datasets
  - Col - PCA, tSNE, UMAP
  - Stored in adata.obs if user uploaded or adata.obsm with "X_" prefix if performed in sc-RNAseq analysis workbench
  - ? Are there other keywords to look for? Should these be selectable before calling the API (to display a specific combo)?
  - ? Does projectR even need to be run? All stored PCA/tSNE/UMAP values are with respect to the observations, like in COLmeta_DIMRED files
  - Pattern repository
  - Col - PCs, Patterns, etc.
  - Currently I have patterns saved at /var/www/patterns
  - Weighted Gene carts
  - Col - Unnamed Weights
  - These are stored in the MySQL DB
- Target dataset Input
  - Genes as rows
  - Observations (adata.obs) as cols
- Projection patterns output dataframe
  - Rows are pattenrs
  - Cols are observations
  - Use adata.obs to tack on conditions, then can use those various conditions as params in the plot
Create plots
- ? What type of plots (scatter, heatmap, etc.)?
  - ? Single-pattern
  - ? Multi-pattern
- ? Run through API calls or use lib/gear code? Will I need to refactor?

Gene aliases

MySQL db table entry (or revival of previous one)
Needed for cross-species projection
Also useful for GCID work

? Projection curation page

Currently building API code on projection.html but will this page be fleshed out for actual saved curations?
This could get complicated if single- and multi-pattern curations have to be considered, especially if they have to go through the plotly_data and multigene_dash_data.py API calls
Save these to database
- Save pattern source (dataset/pattern/genecart)
- ? Do we save the projection pattern dataframe in the target dataset's AnnData file for future quick-reference?
- ? Can we toggle these configs back to "gene" mode (where genes are subbed for patterns in the config)

Front page displays

Toggle to switch between "genes" mode and "projection" mode
? I guess we add the "loadings" source and pattern selection options from projection.html into the sidepanel on this page?
I'm guessing we would use the same default display configs for each dataset but substitute patterns for genes
- This only applies if a default display exists, and the config is a plotly-based config (though I guess we can use Scanpy for tSNE/UMAP... just more complex)
Separate displays for single-pattern and multi-pattern
This UI may be in flux if the page gets redesigned

Compare tool (lower priority)

Need more information here

Crons

Perform Dimension Reduction under various analyses
- PCA
- CoGAPS
? Does this even belong here... should it go on NeMO Archive instead?

adkinsrs commented 2 years ago

Talking points 3/23/22

We do not need to worry about making "projectR" into it's own curation page to curate datasets specifically for projectR use. Taking the existing plot configurations and subbing in patterns for genes is more than enough.
Eventually, we should be pulling patterns and probably some weighted gene carts from a central source, probably hosted on the IGS servers in the NeMO Archive project space. May have to chat with @victor73 about potential issues, or the possibility of syncing these up to a GCP bucket.
I can probably ignore the "existing analysis" button on the prototype page, for now.

Next Steps

Modify the plotting API calls to bypass gene validation if the scope of the plotting call is for "projections" and not "genes"
Have a working prototype of the gene search display page where the user can select a "projections" tab, select a pattern source and number of patterns in place of genes, and it plots using the same display curations available
Modify projectR API call to determine if pattern source is PCA, CoGAPS, etc, and transform "loadings" option in projectR function to be that type. This way, projectR runs these projections correctly.

adkinsrs commented 2 years ago

Talking points 3/30/22

Prototype of pattern results displays are done and look good. Still buggy.
Had a discussion about saving projectR output. I currently have the projectR projection patterns output saved at /tmp/_.csv but we were weighing the merits of keeping it as a physical file or adding output to the "uns" (or other structure) within the dataset's h5ad file. I expressed concern that if enough output from different pattern source projections was included in the file, it could bloat the h5ad file and make it slow to use (even for users who do not care about projections). Of course I cannot verify this claim. Maybe @jorvis can weigh in.
Discussed when to begin linking to other pattern sources, such as those in NeMO Archive. Felt now is not the right time.... let's get things working internally first
Discussed how to infer if pattern is PCA or CoGAPS.
- One suggestion was to have users indicate the inference based on a box in the search results display (in "projections" tab mode), but the users may not know this information.
- On the flip side it may be difficult to infer from all pattern sources unless some metadata was provided (like the DIMREDmeta file from Carlo's data).
- Could be possible to store in the database, but that doesn't feel right if the data is coming from outside the gEAR server. And if the plan to have all gEAR portal flavors access the same patterns, then this database info would have to be synced to all of them, which is probably a no-go.

Next Steps

Flesh out state history code so that if projection info is in URL, it auto searches projections instead of genes.
@carlocolantuoni is going to provide me with a NeMO dataset that I will load locally and test validity of the projectR projections and plots
Fix bugs in general

adkinsrs commented 2 years ago

Talking points 5/4/22

Working on getting weighted gene carts operational. ProjectR works, and can generate projection plots.
- However, loading a weighted gene cart source by URL is not working. Loading a projection source from URL works, but uses a different scope (which lets me know where the tabfiles are stored)
Discussed making the focus more on the weighted gene lists than the pattern sources.
- Weighted gene lists, upon creation, require organism to be specified, which can help with mapping gene symbols to those in a dataset provide the organisms match... eventually for orthologues too.
- Perhaps use description of gene list as information to help the user decide what list to use in the JSTree, or add as a information div.
- Would like to move away from storing projectR output in a file located in temp. Initially discussed adding projectR output to the "uns" section of the dataset's h5ad file. But it also may make sense to just store the projections in a "projection" directory with dataset subdirectories.
Need some sort of indicator (loading symbol, warning, etc) that the projection will have to be created if projectR has not run on this combination of patterns and dataset before (no saved output file). May require having another CGI script to check for this, and maybe we pass that file (if exists) as input to the projectR API call instead of checking in the API call.

carlocolantuoni commented 2 years ago

some detail about saving the projections: maybe we could have a log file in the same "projection" dir that lists the details of every projection ever done and where it is so we know whats there, and if a newly requested one has to be run or just loaded.

adkinsrs commented 2 years ago

some detail about saving the projections: maybe we could have a log file in the same "projection" dir that lists the details of every projection ever done and where it is so we know whats there, and if a newly requested one has to be run or just loaded.

May not be necessary to have a log of all the previous runs. Currently I just check to see if the expected output file exists on the filesystem. If it doesn't, projectR will run fully and the output will be saved in that location. If it does, we just use that file. With the dataset being in the directory path and the pattern source or gene cart share ID being in the filename, it's easy to determine if that combo has been run in projectR yet.

carlocolantuoni commented 2 years ago

If there is enough info in the filename that works, but we may want to have a good bit of info on how the projection was run, so not sure if that will b enough

On Wed, May 4, 2022, 13:59 Shaun Adkins @.***> wrote:

some detail about saving the projections: maybe we could have a log file in the same "projection" dir that lists the details of every projection ever done and where it is so we know whats there, and if a newly requested one has to be run or just loaded.

May not be necessary to have a log of all the previous runs. Currently I just check to see if the expected output file exists on the filesystem. If it doesn't, projectR will run fully and the output will be saved in that location. If it does, we just use that file. With the dataset being in the directory path and the pattern source or gene cart share ID being in the filename, it's easy to determine if that combo has been run in projectR yet.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1117643304, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7TNKEK6QRQTEXZ3SDDVIK3GJANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

If there is enough info in the filename that works, but we may want to have a good bit of info on how the projection was run, so not sure if that will b enough

What other information do we need besides the name of the weights file, the dataset to project on, and if it requires the PCA algorithm in projectR? We can add all that information in the filename

jorvis commented 2 years ago

I'd avoid using the filename unless we are really sure we can elucidate all the current/future parameters we'll want to store, especially when it's so easy to just drop a JSON/YAML/whatever file next to it.

adkinsrs commented 2 years ago

That's a fair point. I was trying to minimize the amount of checking in order to assess if this particular dataset-projection combination has been run. I'll change it to check for a JSON file and read/write to that, in addition to creating the projection output file (which will use a UUID filename).

jorvis commented 2 years ago

Hmm, it's an interesting use case for sure, and the os.exists() call is a lot less than reading in a series of config files. I'll leave it to you to decide. It just gets tricky if parameters are ever added later.

adkinsrs commented 2 years ago

I was thinking of having one "projections.json" file per dataset, and this single JSON file would store a dict of list of configurations, where the pattern source would be the key. Each configuration would include the UUID filename, and any set parameters like if PCA algorithm was used in ProjectR. That way, we only need two os.exists() at most (projections.json, and the UUID output file). Only downside here potentially could be information retrieval speed if the JSON got huge.

{ 
  <pattern_source>: [
    configuration options dict * N configs
    ],
  <pattern_source 2>: [
    configuration_options dict * N configs
    ]
}

jorvis commented 2 years ago

That seems safe. If doing projections on any individual datasets gets so popular that we have to worry about the size of that projections.json file we'll probably have to be refactoring anyway.

adkinsrs commented 2 years ago

Talking points 5/11/22

See image for things to improve for the coming week.

Other talking things

Carlo feels it's fine to not be able to access projections from the front page (search genes, then click projections tab while in results page). My gut feeling is that people will want to avoid that 2-step process but I'm content with leaving as-is for now
I believe that @jorvis will be working on the ortholog stuff, so projecting and plotting will currently only work on the same organism for the time being. Once that is done, I can write code to do interspecies checks.

Future steps

Put on gear-devel or nemoanalytics-devel (along with patterns) for Carlo to test. Can only test with human data given the patterns right now.
Eventually would like to save weighted gene carts from a couple of tools:
- Genes vs LogFC using the compare tool
- Genes vs PCs using the PCA output from the scRNA-seq analysis workbench.
Can add some buttons to save those gene carts (I guess for the current user).

adkinsrs commented 2 years ago

Regarding the "clear old search results" comment in the image, it is worth noting that they are cleared much, much faster on the servers than on my Docker instance. But there is a brief moment, particularly if you have to run projectR, where having the old results displayed can generate confusion

adkinsrs commented 2 years ago

May 25, 2022 Notes

ProjectR works on nemo-devel with some minor things that do not plot for various reasons
I recently realized that uploaded weighted gene carts are saved in the same format they are uploaded in (excel, csv, tab). However I sort-of made the assumption that they would be in tab format. Need to edit save_new_genecart_form.cgi to convert the input format to tab format when saved in the "carts" directory.
Another step to work on is to attempt to map gene symbols to ensembl IDs if from the same species. We can make use of some of the "search_genes.py" code to accomplish this. It is fine and perhaps expected that not every gene symbol from a weighted cart would map to an ensembl ID from a dataset (or to the mysql database).
- Potentially when uploading the weighted gene cart, we can add an extra column and provide a mapping column in the saved cart as opposed to mapping when running projectR. I need to weigh out the pros and cons of implementing one vs the other.
- Another idea was to save mappings either in the projections.json or somewhere related, for quicker access.
Need to rope in @jorvis to discuss eventual cross-species ortholog mapping.

Next meeting will be on June 15 due to various unavailabilities between @carlocolantuoni and myself.

adkinsrs commented 2 years ago

@jorvis and I had a brief slack chat, and feel it's best to require any uploaded weighted gene cart to have both a column of identifiers and a column of gene symbols. Any weighted gene carts without both columns present will not be accepted. This would also help with cross-species mapping as well since it can be based off of the identifier. I think the identifier should be the first column, the gene symbol should be the second column and any subsequent columns can be the weights. Any previously-uploaded gene carts, can be modified with existing scripts.

adkinsrs commented 2 years ago

Note to myself - I will need to modify the "save weighted gene cart" processes in the compare tool and analysis workbench to account for the identifier/gene_sym column pair.

carlocolantuoni commented 2 years ago

Sounds good

On Tue, May 31, 2022, 15:26 Shaun Adkins @.***> wrote:

@jorvis https://github.com/jorvis and I had a brief slack chat, and feel it's best to require any uploaded weighted gene cart to have both a column of identifiers and a column of gene symbols. Any weighted gene carts without both columns present will not be accepted. This would also help with cross-species mapping as well since it can be based off of the identifier. I think the identifier should be the first column, the gene symbol should be the second column and any subsequent columns can be the weights. Any previously-uploaded gene carts, can be modified with existing scripts.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1142558246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7QRL5UMBG4LKOEZG2DVMZRYDANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

June 15, 2022 Notes

Action Items

@carlocolantuoni reports that things previously working in nemo-devel do not work now. Maybe I updated some code by mistake? Anyways I am going to roll out my most recent changes (since projections work on my Docker instance), and see if the nemo-devel ones are still broke.
- @carlocolantuoni, can you generate some 2-column weighted gene carts (uniq IDs, and gene symbols) for me to test with? The current ones on nemo-devel are just a single column and weights. @jorvis mentioned previously he has scripts that can retrieve the Ensembl ID or the gene symbols
Would like to add @jorvis to next Wednesday's 10:30a (EDT) meeting to discuss cross-species ortholog mapping if he can attend.

Other

There was discussion on having a tool in the gene cart manager page to map the identifiers or the gene symbols, provided the other + the organism is known. This would give flexibility to users who may not have the ID-to-gene symbol mapping, or do not know how to acquire it. I am ok with the idea, but am a bit wary that what we generate may differ from what the user expects (perhaps we put up a warning message).

carlocolantuoni commented 2 years ago

yes will make some 2 col gene carts to upload

On Wed, Jun 15, 2022 at 11:04 AM Shaun Adkins @.***> wrote:

June 15, 2022 Notes Action Items

@carlocolantuoni https://github.com/carlocolantuoni reports that things previously working in nemo-devel do not work now. Maybe I updated some code by mistake? Anyways I am going to roll out my most recent changes (since projections work on my Docker instance), and see if the nemo-devel ones are still broke.

@carlocolantuoni https://github.com/carlocolantuoni, can you generate some 2-column weighted gene carts (uniq IDs, and gene symbols) for me to test with? The current ones on nemo-devel are just a single column and weights. @jorvis https://github.com/jorvis mentioned previously he has scripts that can retrieve the Ensembl ID or the gene symbols

Would like to add @jorvis https://github.com/jorvis to next Wednesday's 10:30a (EDT) meeting to discuss cross-species ortholog mapping if he can attend.

Other

There was discussion on having a tool in the gene cart manager page to map the identifiers or the gene symbols, provided the other + the organism is known. This would give flexibility to users who may not have the ID-to-gene symbol mapping, or do not know how to acquire it. I am ok with the idea, but am a bit wary that what we generate may differ from what the user expects (perhaps we put up a warning message).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1156589470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7X6N3OWZS372YEXF2TVPHWGBANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

adkinsrs commented 2 years ago

Another suggestion by @carlocolantuoni

We currently have projections.json files in each dataset directory. We can also write reciprocal projection.json files for a per-genecart basis where the dataset ID would be the key. Could also symlink the projection output file in that directory too.

carlocolantuoni commented 2 years ago

shaun - i will upload new gene carts with 2 cols of gene ids: ensmbl IDs and gene symbols. what order do these need to be in (ensGid then symbol, or reverse?) and do they need specific header names?

carlocolantuoni commented 2 years ago

i was going to upload gene carts to nemo devel but im getting "This site can’t be reached" - is that where i should upload them - and is nemo devel working?

adkinsrs commented 2 years ago

@carlocolantuoni I'll take a look at that today. Other than setting that one profile you use to public, I have not touched the codebase on nemo-devel.

adkinsrs commented 2 years ago

@carlocolantuoni It seemed like the VM was hosed (I couldn't SSH into it), so I stopped it and started it back up. Site seems to be working now.

I also updated the codebase on nemo-devel as well. The previous code base was commit 2587998.

In addition, since I reworked the directories where the projection.json files (and projection output CSVs) are stored, I wiped the existing projection.json files. So any previous projection will need to go through again.

carlocolantuoni commented 2 years ago

Ok, i will upload the new gene carts shortly, can u remind me of column order and column headers needed for the new 2 id format?

On Fri, Jun 17, 2022, 11:08 Shaun Adkins @.***> wrote:

@carlocolantuoni https://github.com/carlocolantuoni It seemed like the VM was hosed (I couldn't SSH into it), so I stopped it and started it back up. Site seems to be working now.

I also updated the codebase on nemo-devel as well. The previous code base was commit 2587998 https://github.com/IGS/gEAR/commit/2587998b6e9ff7209396d685102dbd30657312ed .

In addition, since I reworked the directories where the projection.json files (and projection output CSVs) are stored, I wiped the existing projection.json files. So any previous projection will need to go through again.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1158965904, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7SMIFXDH3CLLKGH6ALVPSIGLANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

Col 1 - Unique identifier Col 2 - Gene symbol Col 3+ - numeric weights

Headers are required.

carlocolantuoni commented 2 years ago

i just uploaded 3 new gene carts - all with titles in format "xxxxx.2idCols". doesnt look like any are working for projection, but the old carts are also now not working (they were partially working previously).

you might want to check this: when i "preview" the genecarts they retain a blank row with the header info of the 1st column (i.e. ensembl gene ID / unique ID) and has dropped the gene symbol. so i think we need to retain both ids and correct the import of the header row.

also, the files i uploaded they were all .tab, does look like .csv works - lets make both work or let users know what format they need to be in

adkinsrs commented 2 years ago

i just uploaded 3 new gene carts - all with titles in format "xxxxx.2idCols". doesnt look like any are working for projection, but the old carts are also now not working (they were partially working previously).

The old carts will not work because the codebase is now configured to read from the newer 2-column format. I really should delete the previous 1-column carts out of the database since that is being used to populate the list of carts to choose. As for why the new cart is not working, I will look into this later. Finally starting to get some stability with projecting all the datasets (that don't run out of mem/cpu resources) and I want to push that code up to nemo-devel.

adkinsrs commented 2 years ago

you might want to check this: when i "preview" the genecarts they retain a blank row with the header info of the 1st column (i.e. ensembl gene ID / unique ID) and has dropped the gene symbol. so i think we need to retain both ids and correct the import of the header row.

Did not realize you could preview genecarts... is this from just clicking a genecart in the manager? Seems I will have to make some code changes for that. Since a weighted genecart is being saved as both a .tab and a .h5ad, I imagine the preview is reading from the h5ad, where it looks at strictly the expression using the index in the gene dataframe as the expression index.

also, the files i uploaded they were all .tab, does look like .csv works - lets make both work or let users know what format they need to be in

I have code in place to read from Excel, csv, or tab

adkinsrs commented 2 years ago

Ok pushed a commit to nemo-devel to fix the genecart preview to show both "gene" columns. The reason why the first column's header was weird was that Pandas puts the "index" header a row down from the other column headers.

Also am running one of the genecarts (Huttp12) through the updated front page search using the CarloTEMP profile. Given how many datasets are in this profile, this should be a good litmus test to my projectR code changes. So far, this is looking great with either the "no common genes" error popping up, or a valid projection plat popping up for each dataset.

I discovered that everytime the Flask-RESTful API is called, it imports all of the modules used on a global level. For rpy2, this meant importing the high-level interface "rpy2.robjects" package, which auto-initializes an R instance. So regardless if the projectR API route was called or a plotting API route was called, a new instance of R was initialized. So I believe I found some stability by only importing this particular package whenever the rfuncs.run_projectR_cmd function is called, which should dramatically reduce the number of R instances that are called. I also set up a lock on the entire function so that it is allowed to complete before another another thread is spawned, which is going to trade off speed for stability. Keep in mind, once projectR is run on a dataset-genecart-extra args combination, we shouldn't need to run that again, so subsequent runs should be faster.

Things I still need to do:

Hide the hoverable information bar when an error shows (like "no common genes found")
Show information on number of genes in genecart vs dataset vs intersecting genes (which is now saved in the projections.json file on a "by_genecart" and "by_dataset" scope)

carlocolantuoni commented 2 years ago

All sounds great shaun - good stuff. I will take take peek later to see the the CarloTEMP projection of Huttp12 worked

On Fri, Jun 17, 2022, 15:00 Shaun Adkins @.***> wrote:

Ok pushed a commit to nemo-devel to fix the genecart preview to show both "gene" columns. The reason why the first column's header was weird was that Pandas puts the "index" header a row down from the other column headers.

Also am running one of the genecarts (Huttp12) through the updated front page search using the CarloTEMP profile. Given how many datasets are in this profile, this should be a good litmus test to my projectR code changes.

I discovered that everytime the Flask-RESTful API is called, it imports all of the modules used on a global level. For rpy2, this meant importing the high-level interface "rpy2.robjects" package, which auto-initializes an R instance. So regardless if the projectR API route was called or a plotting API route was called, a new instance of R was initialized. So I believe I found some stability by only importing this particular package whenever the rfuncs.run_projectR_cmd function is called, which should dramatically reduce the number of R instances that are called. I also set up a lock on the entire function so that it is allowed to complete before another another thread is spawned, which is going to trade off speed for stability. Keep in mind, once projectR is run on a dataset-genecart-extra args combination, we shouldn't need to run that again, so subsequent runs should be faster.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1159153451, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7WK5SN66KBV7GIM56DVPTDLXANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

carlocolantuoni commented 2 years ago

looking at the Huttp12 projections - worked for many, didnt for many more others - most common erors seem to be "Gene not found" and "Gene not found in dataset" - as a time stamp its now 11pm friday eve, dont kno if some projections are still running (would guess not ,looks like you sent it going 8 hours ago) or if this is just not working for some datasets.

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

adkinsrs commented 2 years ago

Hi @carlocolantuoni

Using the Huttp12 projections and the CarloTEMP profile I cannot replicate the issue where you are getting "Gene not found" for any datasets. What I see is 20 datasets where there are no common genes between dataset and genecart, and 16 where there are plots. It may be possible that you need to reset your browser cache (CTRL+Shift+R to reload page).

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

I will have to think about this. Currently I write to the .json file when the projectR output file is created, so then there is no issues with bypassing projectR since the output file already exists for use in the plotting step. If we both concurrently request projections, we cannot really have the projection.json file written at start, since the latter of the concurrent process will think a projection output already exists, skip the projectR run, and fail in plotting because the output file may not exist (assuming the earlier request has not finished yet when the latter's plot step starts). Also if we write the projection information upon starting projectR and projectR fails, then there will have to be extra code to rollback that configuration change.

Personally, outside of workshops (which we can pre-run projections), I think the chance of two users requesting a new dataset-projection combination concurrently will be rare. If it happens, there is no harm in letting projectR run for both users. In this case, two outputs will be made and stored, but only one will be retrieved for future requests... duplicating the output and in the config is unfortunate, but it should be very rare and it won't break the system.

carlocolantuoni commented 2 years ago

lookslike u were right - the reload is now giving "No common genes between the target dataset and the pattern file.

On Tue, Jun 21, 2022 at 8:53 AM Shaun Adkins @.***> wrote:

Hi @carlocolantuoni https://github.com/carlocolantuoni

Using the Huttp12 projections and the CarloTEMP profile I cannot replicate the issue where you are getting "Gene not found" for any datasets. What I see is 20 datasets where there are no common genes between dataset and genecart, and 16 where there are plots. It may be possible that you need to reset your browser cache (CTRL+Shift+R to reload page).

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

I will have to think about this. Currently I write to the .json file when the projectR output file is created, so then there is no issues with bypassing projectR since it already exists. If we both concurrently request projections, we cannot really have the projection.json file written at start, since the latter of the concurrent process will think a projection output already exists, skip the projectR run, and fail in plotting because the output file may not exist (assuming the earlier request has not finished yet. Also if we write that the projection information upon starting projectR and projectR fails, then there will have to be extra code to rollback that configuration change. Personally, outside of workshops (which we can pre-run projections), I think the chance of two users requesting a new dataset-projection combination concurrently will be rare, and if it happens there is no harm in letting projectR run for both users (two outputs will be made and stored, but only one will be retrieved for future requests).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1161708389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7R5TMOZEIWSFJYESYLVQG3MJANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

carlocolantuoni commented 2 years ago

the datasets from mouse we would still expect to give this error as the patterns are from human and we havent implemented the gene cross referencing yet. but all the human datasets should be working now that we have 2 different gene ids in the gene cart that is projected, right?

On Wed, Jun 22, 2022 at 2:13 AM Carlo Colantuoni @.***> wrote:

lookslike u were right - the reload is now giving "No common genes between the target dataset and the pattern file.

On Tue, Jun 21, 2022 at 8:53 AM Shaun Adkins @.***> wrote:

Hi @carlocolantuoni https://github.com/carlocolantuoni

Using the Huttp12 projections and the CarloTEMP profile I cannot replicate the issue where you are getting "Gene not found" for any datasets. What I see is 20 datasets where there are no common genes between dataset and genecart, and 16 where there are plots. It may be possible that you need to reset your browser cache (CTRL+Shift+R to reload page).

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

I will have to think about this. Currently I write to the .json file when the projectR output file is created, so then there is no issues with bypassing projectR since it already exists. If we both concurrently request projections, we cannot really have the projection.json file written at start, since the latter of the concurrent process will think a projection output already exists, skip the projectR run, and fail in plotting because the output file may not exist (assuming the earlier request has not finished yet. Also if we write that the projection information upon starting projectR and projectR fails, then there will have to be extra code to rollback that configuration change. Personally, outside of workshops (which we can pre-run projections), I think the chance of two users requesting a new dataset-projection combination concurrently will be rare, and if it happens there is no harm in letting projectR run for both users (two outputs will be made and stored, but only one will be retrieved for future requests).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1161708389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7R5TMOZEIWSFJYESYLVQG3MJANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

-- Carlo

adkinsrs commented 2 years ago

Took a quick check on one of the datasets (MmBrainDevo_ALLagesDownSamp25k: scRNA-seq of the developing mosue brain (La Manno EtAl 2021). ) where no common genes were found. The index of the anndata.var was actually the gene symbols instead of ensembl IDs. So while we have a mapping for the weighted gene cart, the pre-projectR run was based on mapping the unique identifier from the genecart to anndata.var.index, which in this case does not match. So I need to write some code to do extra mapping checks on the chance the uniq identifiers between dataset and weighted genecart do not match (which I did have in-place before I left for vacation but dropped it in favor of the mapping approach in the weighted genecart).

carlocolantuoni commented 2 years ago

Thats also a mouse dataset so it wont work until we get the species mapping in there. Can u jump on our normal zoom link now?

On Wed, Jun 22, 2022, 07:26 Shaun Adkins @.***> wrote:

Took a quick check on one of the datasets (MmBrainDevo_ALLagesDownSamp25k: scRNA-seq of the developing mosue brain (La Manno EtAl 2021). ) where no common genes were found. The index of the anndata.var was actually the gene symbols instead of ensembl IDs. So while we have a mapping for the weighted gene cart, the pre-projectR run was based on mapping the unique identifier from the genecart to anndata.var.index, which in this case does not match. So I need to write some code to do extra mapping checks on the chance the uniq identifiers between dataset and weighted genecart do not match (which I did have in-place before I left for vacation but dropped it in favor of the mapping approach in the weighted genecart).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1163173053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7Q3PWXBJS4A6ARRKSLVQMPCPANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

6/29/2022 Notes

Investigate why some human datasets show no overlapping genes when using a human genecart. It's suspected that the dataset genes use the gene symbol as the index instead of the ensembl ID.
I have confirmed with @carlocolantuoni that every one of these datasets are public so it is OK to list the dataset ID here
IDs
- 1097004c-451a-341b-c18f-cf28d53e02a3
- ca3cfc52-f3ad-84ef-0dd7-30681cd6b3c5
- 0f63c807-139a-98eb-2768-54776533dea3
- 22f84ea4-4d27-f725-ae23-1f838b483ad0
- 34a7806f-f16d-b39b-0907-b399ebbefe27
- 37398715-ff7a-db0a-b8ba-eae1d6eaa122
- f5645717-e60c-ccf3-09ae-5035b5bab2b4 (chimp so maybe human genes)
- e910cb98-590c-9b46-17cf-f14fc3d27f1a
- a7fb94a7-f408-25cf-c570-ef2d1113b240 (chimp with maybe human genes)
- 9d647703-7c94-f85f-ec77-1a1b3b4ab557 (macaque)
- f5a3bbfd-0ac4-6c27-867d-44f24bc48d8e
- 45ba1749-b7ab-2d0c-1250-05c041c7c5c8
- e882ac44-ce4c-751f-5b4a-fafb175ac704
- 6f8b121f-b0e3-02e0-ddb3-6a05b5b521b1
- 73ea4305-0eda-94d2-102a-d4da8c12b525
- 887096d8-6b0e-0ddc-6bf3-ed11b904f173

adkinsrs commented 2 years ago

Can confirm all of these above datasets use non-ensembl IDs as their unique index. They all use gene symbols under a "genes" column name except for the last one, which uses "genes". Only one or two of the gene dataframes actually contains a column for EnsemblIDs, whereas most of the datasets only have the gene symbols duplicated across different columns. I think the gene dataframe in these datasets will need to be updated.

carlocolantuoni commented 2 years ago

Joshua and shaun - can we make an automatic fix for datatsets that dont have ensembl IDs? For example, when a datasets without them is involved in a requested projection, can we run a script to look them up so we can run the projection? I could do this in R with biomaRt, guessing you can in do something similar in python.

On Tue, Jul 5, 2022, 08:31 Shaun Adkins @.***> wrote:

Can confirm all of these above datasets use non-ensembl IDs as their unique index. They all use gene symbols under a "genes" column name except for the last one, which uses "genes". Only one or two of the gene dataframes actually contains a column for EnsemblIDs, whereas most of the datasets only have the gene symbols duplicated across different columns. I think the gene dataframe in these datasets will need to be updated.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1175004693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7TJEV5G5475BMODHDLVSQTJTANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

I think it would be an easy fix for same-species mapping. It may not be so trivial when it comes to cross-species mapping due to one-to-many mappings. We'd have to figure out how to resolve those ambiguous cases. Maybe have an option to randomly choose the first Ensembl ID taken, and another option to skip, and another option to use all Ensembl IDs that mapped.

I believe this would also depend on using the annotation dataframes stored on disk that @jorvis was proposing.

jorvis commented 2 years ago

Also, rather than having this be a part of the execution when a user requests a projection we should instead do the following:

Pass through all current datasets and add IDs for any datasets which don't have them.
Ensure that the upload process involves dataset validation so that all have identifiers AND gene symbols at the beginning.

I've made a flow chart of this here

carlocolantuoni commented 2 years ago

totally agree - that flow looks perfect the specific problem we are hitting with these datasets is due to my own omission of ensembl IDs in the more distant past when i was using only gene symbols in the upload. but of course as there was not a check in the past, many other datasets could be similar. so going thru all current datasets and checking new ones would be perfect to make sure everything works

On Wed, Jul 13, 2022 at 11:45 AM Joshua Orvis @.***> wrote:

Also, rather than having this be a part of the execution when a user requests a projection we should instead do the following:

Pass through all current datasets and add IDs for any datasets which don't have them.

Ensure that the upload process involves dataset validation so that all have identifiers AND gene symbols at the beginning.

I've made a flow chart of this here https://docs.google.com/presentation/d/14EHyXnY9GEOjSKr-hz4MbnL4VJDQgb_3rEiGKotQcl0/edit?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1183387332, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7V2U7NIOVEWTMCIUYDVT3QBTANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

carlocolantuoni commented 2 years ago

as discussed on email: lets also implement a way to project simple, unweighted gene carts. this would be a simply add-on to the R script we currently use for projection.

On Wed, Jul 13, 2022 at 2:14 PM Carlo Colantuoni @.***> wrote:

totally agree - that flow looks perfect the specific problem we are hitting with these datasets is due to my own omission of ensembl IDs in the more distant past when i was using only gene symbols in the upload. but of course as there was not a check in the past, many other datasets could be similar. so going thru all current datasets and checking new ones would be perfect to make sure everything works

On Wed, Jul 13, 2022 at 11:45 AM Joshua Orvis @.***> wrote:

Also, rather than having this be a part of the execution when a user requests a projection we should instead do the following:

Pass through all current datasets and add IDs for any datasets which don't have them.

Ensure that the upload process involves dataset validation so that all have identifiers AND gene symbols at the beginning.

I've made a flow chart of this here https://docs.google.com/presentation/d/14EHyXnY9GEOjSKr-hz4MbnL4VJDQgb_3rEiGKotQcl0/edit?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1183387332, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7V2U7NIOVEWTMCIUYDVT3QBTANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

-- Carlo

carlocolantuoni commented 2 years ago

also as discussed on email:

we will expand where we can get gene carts from and what we can do with them - e.g. already getting them from the compare tool and PCA is great. can we draw gene carts from the volcano plot in the multi gene view? where else does gEAR/NeMO perform analyses that would be useful for this? can we associate a gene cart with a particular dataset (e.g. PCA of dataset X with dataset X)?

further, performing simple mathematical operations/transformation on gene carts and visualizing gene carts will be extremely important in understanding and using gene carts, e.g. plot one against another or log transform a weighted gene cart.

is there a gene cart manager ticket i should move these points to?

adkinsrs commented 2 years ago

@carlocolantuoni I can make a separate ticket for each of these (will do next week).

I have started refactoring some code to accommodate for the unweighted gene cart changes

carlocolantuoni commented 2 years ago

Great! Thanks shaun

On Sat, Jul 16, 2022, 21:02 Shaun Adkins @.***> wrote:

@carlocolantuoni https://github.com/carlocolantuoni I can make a separate ticket for each of these (will do next week).

I have started refactoring some code to accommodate for the unweighted gene cart changes

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1186359494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7QNBARVF5NHB3H2TNTVUNLSRANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

IGS / gEAR

Implement ProjectR #243

ProjectR scratch notes

NeMO Archive

NeMO Analytics

API

Gene aliases

? Projection curation page

Front page displays

Compare tool (lower priority)

Crons

Next Steps

Next Steps

Talking points 5/11/22

Other talking things

Future steps

May 25, 2022 Notes

June 15, 2022 Notes

Action Items

Other

6/29/2022 Notes