IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
14 stars 4 forks source link

Implement ProjectR #243

Closed adkinsrs closed 1 year ago

adkinsrs commented 2 years ago

This is a continuation of https://github.com/nemoarchive/analytics/issues/104 and https://github.com/nemoarchive/analytics/issues/111

Tagging carlocolantuoni to notify him of the new ticket, and I will close the existing tickets

Screen Shot 2022-03-10 at 10 28 56 AM

Will make some subtasks for each of the diagrammed areas in the screenshot.

Currently a basic demo exists on umgear.org/projection.html but it only works for specific datasets. But we have the display and gene list architecture in place to integrate this into the front page/gene results page.

jorvis commented 2 years ago

Here's the recording from our overview meeting on Friday:

https://zoom.us/rec/share/sTRLpBzpVTR_Oa_1wslLL2XwozGWqPFqViomY5tpbfWbxLWHJQCDCMMOWVMxQ8Q.8cf1SuBuoyQ336iN

Passcode: #v=Db34c

adkinsrs commented 2 years ago

ProjectR scratch notes

NeMO Archive

NeMO Analytics

Keep in mind this needs to ideally work on any gEAR portal

API

Gene aliases

? Projection curation page

Front page displays

Compare tool (lower priority)

Crons

adkinsrs commented 2 years ago

Talking points 3/23/22

Next Steps

adkinsrs commented 2 years ago

Talking points 3/30/22

Next Steps

adkinsrs commented 2 years ago

Talking points 5/4/22

carlocolantuoni commented 2 years ago

some detail about saving the projections: maybe we could have a log file in the same "projection" dir that lists the details of every projection ever done and where it is so we know whats there, and if a newly requested one has to be run or just loaded.

adkinsrs commented 2 years ago

some detail about saving the projections: maybe we could have a log file in the same "projection" dir that lists the details of every projection ever done and where it is so we know whats there, and if a newly requested one has to be run or just loaded.

May not be necessary to have a log of all the previous runs. Currently I just check to see if the expected output file exists on the filesystem. If it doesn't, projectR will run fully and the output will be saved in that location. If it does, we just use that file. With the dataset being in the directory path and the pattern source or gene cart share ID being in the filename, it's easy to determine if that combo has been run in projectR yet.

carlocolantuoni commented 2 years ago

If there is enough info in the filename that works, but we may want to have a good bit of info on how the projection was run, so not sure if that will b enough

On Wed, May 4, 2022, 13:59 Shaun Adkins @.***> wrote:

some detail about saving the projections: maybe we could have a log file in the same "projection" dir that lists the details of every projection ever done and where it is so we know whats there, and if a newly requested one has to be run or just loaded.

May not be necessary to have a log of all the previous runs. Currently I just check to see if the expected output file exists on the filesystem. If it doesn't, projectR will run fully and the output will be saved in that location. If it does, we just use that file. With the dataset being in the directory path and the pattern source or gene cart share ID being in the filename, it's easy to determine if that combo has been run in projectR yet.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1117643304, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7TNKEK6QRQTEXZ3SDDVIK3GJANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

If there is enough info in the filename that works, but we may want to have a good bit of info on how the projection was run, so not sure if that will b enough

What other information do we need besides the name of the weights file, the dataset to project on, and if it requires the PCA algorithm in projectR? We can add all that information in the filename

jorvis commented 2 years ago

I'd avoid using the filename unless we are really sure we can elucidate all the current/future parameters we'll want to store, especially when it's so easy to just drop a JSON/YAML/whatever file next to it.

adkinsrs commented 2 years ago

That's a fair point. I was trying to minimize the amount of checking in order to assess if this particular dataset-projection combination has been run. I'll change it to check for a JSON file and read/write to that, in addition to creating the projection output file (which will use a UUID filename).

jorvis commented 2 years ago

Hmm, it's an interesting use case for sure, and the os.exists() call is a lot less than reading in a series of config files. I'll leave it to you to decide. It just gets tricky if parameters are ever added later.

adkinsrs commented 2 years ago

I was thinking of having one "projections.json" file per dataset, and this single JSON file would store a dict of list of configurations, where the pattern source would be the key. Each configuration would include the UUID filename, and any set parameters like if PCA algorithm was used in ProjectR. That way, we only need two os.exists() at most (projections.json, and the UUID output file). Only downside here potentially could be information retrieval speed if the JSON got huge.

{ 
  <pattern_source>: [
    configuration options dict * N configs
    ],
  <pattern_source 2>: [
    configuration_options dict * N configs
    ]
}
jorvis commented 2 years ago

That seems safe. If doing projections on any individual datasets gets so popular that we have to worry about the size of that projections.json file we'll probably have to be refactoring anyway.

adkinsrs commented 2 years ago

Talking points 5/11/22

See image for things to improve for the coming week.

Screen Shot 2022-05-11 at 2 03 00 PM

Other talking things

Future steps

adkinsrs commented 2 years ago

Regarding the "clear old search results" comment in the image, it is worth noting that they are cleared much, much faster on the servers than on my Docker instance. But there is a brief moment, particularly if you have to run projectR, where having the old results displayed can generate confusion

adkinsrs commented 2 years ago

May 25, 2022 Notes

Next meeting will be on June 15 due to various unavailabilities between @carlocolantuoni and myself.

adkinsrs commented 2 years ago

@jorvis and I had a brief slack chat, and feel it's best to require any uploaded weighted gene cart to have both a column of identifiers and a column of gene symbols. Any weighted gene carts without both columns present will not be accepted. This would also help with cross-species mapping as well since it can be based off of the identifier. I think the identifier should be the first column, the gene symbol should be the second column and any subsequent columns can be the weights. Any previously-uploaded gene carts, can be modified with existing scripts.

adkinsrs commented 2 years ago

Note to myself - I will need to modify the "save weighted gene cart" processes in the compare tool and analysis workbench to account for the identifier/gene_sym column pair.

carlocolantuoni commented 2 years ago

Sounds good

On Tue, May 31, 2022, 15:26 Shaun Adkins @.***> wrote:

@jorvis https://github.com/jorvis and I had a brief slack chat, and feel it's best to require any uploaded weighted gene cart to have both a column of identifiers and a column of gene symbols. Any weighted gene carts without both columns present will not be accepted. This would also help with cross-species mapping as well since it can be based off of the identifier. I think the identifier should be the first column, the gene symbol should be the second column and any subsequent columns can be the weights. Any previously-uploaded gene carts, can be modified with existing scripts.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1142558246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7QRL5UMBG4LKOEZG2DVMZRYDANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

June 15, 2022 Notes

Action Items

Other

carlocolantuoni commented 2 years ago

yes will make some 2 col gene carts to upload

On Wed, Jun 15, 2022 at 11:04 AM Shaun Adkins @.***> wrote:

June 15, 2022 Notes Action Items

  • @carlocolantuoni https://github.com/carlocolantuoni reports that things previously working in nemo-devel do not work now. Maybe I updated some code by mistake? Anyways I am going to roll out my most recent changes (since projections work on my Docker instance), and see if the nemo-devel ones are still broke.
    • @carlocolantuoni https://github.com/carlocolantuoni, can you generate some 2-column weighted gene carts (uniq IDs, and gene symbols) for me to test with? The current ones on nemo-devel are just a single column and weights. @jorvis https://github.com/jorvis mentioned previously he has scripts that can retrieve the Ensembl ID or the gene symbols
  • Would like to add @jorvis https://github.com/jorvis to next Wednesday's 10:30a (EDT) meeting to discuss cross-species ortholog mapping if he can attend.

Other

  • There was discussion on having a tool in the gene cart manager page to map the identifiers or the gene symbols, provided the other + the organism is known. This would give flexibility to users who may not have the ID-to-gene symbol mapping, or do not know how to acquire it. I am ok with the idea, but am a bit wary that what we generate may differ from what the user expects (perhaps we put up a warning message).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1156589470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7X6N3OWZS372YEXF2TVPHWGBANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

adkinsrs commented 2 years ago

Another suggestion by @carlocolantuoni

We currently have projections.json files in each dataset directory. We can also write reciprocal projection.json files for a per-genecart basis where the dataset ID would be the key. Could also symlink the projection output file in that directory too.

carlocolantuoni commented 2 years ago

shaun - i will upload new gene carts with 2 cols of gene ids: ensmbl IDs and gene symbols. what order do these need to be in (ensGid then symbol, or reverse?) and do they need specific header names?

carlocolantuoni commented 2 years ago

i was going to upload gene carts to nemo devel but im getting "This site can’t be reached" - is that where i should upload them - and is nemo devel working?

adkinsrs commented 2 years ago

@carlocolantuoni I'll take a look at that today. Other than setting that one profile you use to public, I have not touched the codebase on nemo-devel.

adkinsrs commented 2 years ago

@carlocolantuoni It seemed like the VM was hosed (I couldn't SSH into it), so I stopped it and started it back up. Site seems to be working now.

I also updated the codebase on nemo-devel as well. The previous code base was commit 2587998.

In addition, since I reworked the directories where the projection.json files (and projection output CSVs) are stored, I wiped the existing projection.json files. So any previous projection will need to go through again.

carlocolantuoni commented 2 years ago

Ok, i will upload the new gene carts shortly, can u remind me of column order and column headers needed for the new 2 id format?

On Fri, Jun 17, 2022, 11:08 Shaun Adkins @.***> wrote:

@carlocolantuoni https://github.com/carlocolantuoni It seemed like the VM was hosed (I couldn't SSH into it), so I stopped it and started it back up. Site seems to be working now.

I also updated the codebase on nemo-devel as well. The previous code base was commit 2587998 https://github.com/IGS/gEAR/commit/2587998b6e9ff7209396d685102dbd30657312ed .

In addition, since I reworked the directories where the projection.json files (and projection output CSVs) are stored, I wiped the existing projection.json files. So any previous projection will need to go through again.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1158965904, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7SMIFXDH3CLLKGH6ALVPSIGLANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

Col 1 - Unique identifier Col 2 - Gene symbol Col 3+ - numeric weights

Headers are required.

carlocolantuoni commented 2 years ago

i just uploaded 3 new gene carts - all with titles in format "xxxxx.2idCols". doesnt look like any are working for projection, but the old carts are also now not working (they were partially working previously).

you might want to check this: when i "preview" the genecarts they retain a blank row with the header info of the 1st column (i.e. ensembl gene ID / unique ID) and has dropped the gene symbol. so i think we need to retain both ids and correct the import of the header row.

also, the files i uploaded they were all .tab, does look like .csv works - lets make both work or let users know what format they need to be in

adkinsrs commented 2 years ago

i just uploaded 3 new gene carts - all with titles in format "xxxxx.2idCols". doesnt look like any are working for projection, but the old carts are also now not working (they were partially working previously).

The old carts will not work because the codebase is now configured to read from the newer 2-column format. I really should delete the previous 1-column carts out of the database since that is being used to populate the list of carts to choose. As for why the new cart is not working, I will look into this later. Finally starting to get some stability with projecting all the datasets (that don't run out of mem/cpu resources) and I want to push that code up to nemo-devel.

adkinsrs commented 2 years ago

you might want to check this: when i "preview" the genecarts they retain a blank row with the header info of the 1st column (i.e. ensembl gene ID / unique ID) and has dropped the gene symbol. so i think we need to retain both ids and correct the import of the header row.

Did not realize you could preview genecarts... is this from just clicking a genecart in the manager? Seems I will have to make some code changes for that. Since a weighted genecart is being saved as both a .tab and a .h5ad, I imagine the preview is reading from the h5ad, where it looks at strictly the expression using the index in the gene dataframe as the expression index.

also, the files i uploaded they were all .tab, does look like .csv works - lets make both work or let users know what format they need to be in

I have code in place to read from Excel, csv, or tab

adkinsrs commented 2 years ago

Ok pushed a commit to nemo-devel to fix the genecart preview to show both "gene" columns. The reason why the first column's header was weird was that Pandas puts the "index" header a row down from the other column headers.

Also am running one of the genecarts (Huttp12) through the updated front page search using the CarloTEMP profile. Given how many datasets are in this profile, this should be a good litmus test to my projectR code changes. So far, this is looking great with either the "no common genes" error popping up, or a valid projection plat popping up for each dataset.

I discovered that everytime the Flask-RESTful API is called, it imports all of the modules used on a global level. For rpy2, this meant importing the high-level interface "rpy2.robjects" package, which auto-initializes an R instance. So regardless if the projectR API route was called or a plotting API route was called, a new instance of R was initialized. So I believe I found some stability by only importing this particular package whenever the rfuncs.run_projectR_cmd function is called, which should dramatically reduce the number of R instances that are called. I also set up a lock on the entire function so that it is allowed to complete before another another thread is spawned, which is going to trade off speed for stability. Keep in mind, once projectR is run on a dataset-genecart-extra args combination, we shouldn't need to run that again, so subsequent runs should be faster.

Things I still need to do:

carlocolantuoni commented 2 years ago

All sounds great shaun - good stuff. I will take take peek later to see the the CarloTEMP projection of Huttp12 worked

On Fri, Jun 17, 2022, 15:00 Shaun Adkins @.***> wrote:

Ok pushed a commit to nemo-devel to fix the genecart preview to show both "gene" columns. The reason why the first column's header was weird was that Pandas puts the "index" header a row down from the other column headers.

Also am running one of the genecarts (Huttp12) through the updated front page search using the CarloTEMP profile. Given how many datasets are in this profile, this should be a good litmus test to my projectR code changes.

I discovered that everytime the Flask-RESTful API is called, it imports all of the modules used on a global level. For rpy2, this meant importing the high-level interface "rpy2.robjects" package, which auto-initializes an R instance. So regardless if the projectR API route was called or a plotting API route was called, a new instance of R was initialized. So I believe I found some stability by only importing this particular package whenever the rfuncs.run_projectR_cmd function is called, which should dramatically reduce the number of R instances that are called. I also set up a lock on the entire function so that it is allowed to complete before another another thread is spawned, which is going to trade off speed for stability. Keep in mind, once projectR is run on a dataset-genecart-extra args combination, we shouldn't need to run that again, so subsequent runs should be faster.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1159153451, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7WK5SN66KBV7GIM56DVPTDLXANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

carlocolantuoni commented 2 years ago

looking at the Huttp12 projections - worked for many, didnt for many more others - most common erors seem to be "Gene not found" and "Gene not found in dataset" - as a time stamp its now 11pm friday eve, dont kno if some projections are still running (would guess not ,looks like you sent it going 8 hours ago) or if this is just not working for some datasets.

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

adkinsrs commented 2 years ago

Hi @carlocolantuoni

Using the Huttp12 projections and the CarloTEMP profile I cannot replicate the issue where you are getting "Gene not found" for any datasets. What I see is 20 datasets where there are no common genes between dataset and genecart, and 16 where there are plots. It may be possible that you need to reset your browser cache (CTRL+Shift+R to reload page).

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

I will have to think about this. Currently I write to the .json file when the projectR output file is created, so then there is no issues with bypassing projectR since the output file already exists for use in the plotting step. If we both concurrently request projections, we cannot really have the projection.json file written at start, since the latter of the concurrent process will think a projection output already exists, skip the projectR run, and fail in plotting because the output file may not exist (assuming the earlier request has not finished yet when the latter's plot step starts). Also if we write the projection information upon starting projectR and projectR fails, then there will have to be extra code to rollback that configuration change.

Personally, outside of workshops (which we can pre-run projections), I think the chance of two users requesting a new dataset-projection combination concurrently will be rare. If it happens, there is no harm in letting projectR run for both users. In this case, two outputs will be made and stored, but only one will be retrieved for future requests... duplicating the output and in the config is unfortunate, but it should be very rare and it won't break the system.

carlocolantuoni commented 2 years ago

lookslike u were right - the reload is now giving "No common genes between the target dataset and the pattern file.

On Tue, Jun 21, 2022 at 8:53 AM Shaun Adkins @.***> wrote:

Hi @carlocolantuoni https://github.com/carlocolantuoni

Using the Huttp12 projections and the CarloTEMP profile I cannot replicate the issue where you are getting "Gene not found" for any datasets. What I see is 20 datasets where there are no common genes between dataset and genecart, and 16 where there are plots. It may be possible that you need to reset your browser cache (CTRL+Shift+R to reload page).

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

I will have to think about this. Currently I write to the .json file when the projectR output file is created, so then there is no issues with bypassing projectR since it already exists. If we both concurrently request projections, we cannot really have the projection.json file written at start, since the latter of the concurrent process will think a projection output already exists, skip the projectR run, and fail in plotting because the output file may not exist (assuming the earlier request has not finished yet. Also if we write that the projection information upon starting projectR and projectR fails, then there will have to be extra code to rollback that configuration change. Personally, outside of workshops (which we can pre-run projections), I think the chance of two users requesting a new dataset-projection combination concurrently will be rare, and if it happens there is no harm in letting projectR run for both users (two outputs will be made and stored, but only one will be retrieved for future requests).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1161708389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7R5TMOZEIWSFJYESYLVQG3MJANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

carlocolantuoni commented 2 years ago

the datasets from mouse we would still expect to give this error as the patterns are from human and we havent implemented the gene cross referencing yet. but all the human datasets should be working now that we have 2 different gene ids in the gene cart that is projected, right?

On Wed, Jun 22, 2022 at 2:13 AM Carlo Colantuoni @.***> wrote:

lookslike u were right - the reload is now giving "No common genes between the target dataset and the pattern file.

On Tue, Jun 21, 2022 at 8:53 AM Shaun Adkins @.***> wrote:

Hi @carlocolantuoni https://github.com/carlocolantuoni

Using the Huttp12 projections and the CarloTEMP profile I cannot replicate the issue where you are getting "Gene not found" for any datasets. What I see is 20 datasets where there are no common genes between dataset and genecart, and 16 where there are plots. It may be possible that you need to reset your browser cache (CTRL+Shift+R to reload page).

another issue came up in my head just now - if you requested projections are still running, and i request the same ones, would that repeat them? do we need to write to the .json file when we start and not end projections so this is not a problem?

I will have to think about this. Currently I write to the .json file when the projectR output file is created, so then there is no issues with bypassing projectR since it already exists. If we both concurrently request projections, we cannot really have the projection.json file written at start, since the latter of the concurrent process will think a projection output already exists, skip the projectR run, and fail in plotting because the output file may not exist (assuming the earlier request has not finished yet. Also if we write that the projection information upon starting projectR and projectR fails, then there will have to be extra code to rollback that configuration change. Personally, outside of workshops (which we can pre-run projections), I think the chance of two users requesting a new dataset-projection combination concurrently will be rare, and if it happens there is no harm in letting projectR run for both users (two outputs will be made and stored, but only one will be retrieved for future requests).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1161708389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7R5TMOZEIWSFJYESYLVQG3MJANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

-- Carlo

adkinsrs commented 2 years ago

Took a quick check on one of the datasets (MmBrainDevo_ALLagesDownSamp25k: scRNA-seq of the developing mosue brain (La Manno EtAl 2021). ) where no common genes were found. The index of the anndata.var was actually the gene symbols instead of ensembl IDs. So while we have a mapping for the weighted gene cart, the pre-projectR run was based on mapping the unique identifier from the genecart to anndata.var.index, which in this case does not match. So I need to write some code to do extra mapping checks on the chance the uniq identifiers between dataset and weighted genecart do not match (which I did have in-place before I left for vacation but dropped it in favor of the mapping approach in the weighted genecart).

carlocolantuoni commented 2 years ago

Thats also a mouse dataset so it wont work until we get the species mapping in there. Can u jump on our normal zoom link now?

On Wed, Jun 22, 2022, 07:26 Shaun Adkins @.***> wrote:

Took a quick check on one of the datasets (MmBrainDevo_ALLagesDownSamp25k: scRNA-seq of the developing mosue brain (La Manno EtAl 2021). ) where no common genes were found. The index of the anndata.var was actually the gene symbols instead of ensembl IDs. So while we have a mapping for the weighted gene cart, the pre-projectR run was based on mapping the unique identifier from the genecart to anndata.var.index, which in this case does not match. So I need to write some code to do extra mapping checks on the chance the uniq identifiers between dataset and weighted genecart do not match (which I did have in-place before I left for vacation but dropped it in favor of the mapping approach in the weighted genecart).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1163173053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7Q3PWXBJS4A6ARRKSLVQMPCPANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

6/29/2022 Notes

adkinsrs commented 2 years ago

Can confirm all of these above datasets use non-ensembl IDs as their unique index. They all use gene symbols under a "genes" column name except for the last one, which uses "genes". Only one or two of the gene dataframes actually contains a column for EnsemblIDs, whereas most of the datasets only have the gene symbols duplicated across different columns. I think the gene dataframe in these datasets will need to be updated.

carlocolantuoni commented 2 years ago

Joshua and shaun - can we make an automatic fix for datatsets that dont have ensembl IDs? For example, when a datasets without them is involved in a requested projection, can we run a script to look them up so we can run the projection? I could do this in R with biomaRt, guessing you can in do something similar in python.

On Tue, Jul 5, 2022, 08:31 Shaun Adkins @.***> wrote:

Can confirm all of these above datasets use non-ensembl IDs as their unique index. They all use gene symbols under a "genes" column name except for the last one, which uses "genes". Only one or two of the gene dataframes actually contains a column for EnsemblIDs, whereas most of the datasets only have the gene symbols duplicated across different columns. I think the gene dataframe in these datasets will need to be updated.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1175004693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7TJEV5G5475BMODHDLVSQTJTANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

adkinsrs commented 2 years ago

I think it would be an easy fix for same-species mapping. It may not be so trivial when it comes to cross-species mapping due to one-to-many mappings. We'd have to figure out how to resolve those ambiguous cases. Maybe have an option to randomly choose the first Ensembl ID taken, and another option to skip, and another option to use all Ensembl IDs that mapped.

I believe this would also depend on using the annotation dataframes stored on disk that @jorvis was proposing.

jorvis commented 2 years ago

Also, rather than having this be a part of the execution when a user requests a projection we should instead do the following:

  1. Pass through all current datasets and add IDs for any datasets which don't have them.
  2. Ensure that the upload process involves dataset validation so that all have identifiers AND gene symbols at the beginning.

I've made a flow chart of this here

carlocolantuoni commented 2 years ago

totally agree - that flow looks perfect the specific problem we are hitting with these datasets is due to my own omission of ensembl IDs in the more distant past when i was using only gene symbols in the upload. but of course as there was not a check in the past, many other datasets could be similar. so going thru all current datasets and checking new ones would be perfect to make sure everything works

On Wed, Jul 13, 2022 at 11:45 AM Joshua Orvis @.***> wrote:

Also, rather than having this be a part of the execution when a user requests a projection we should instead do the following:

  1. Pass through all current datasets and add IDs for any datasets which don't have them.
  2. Ensure that the upload process involves dataset validation so that all have identifiers AND gene symbols at the beginning.

I've made a flow chart of this here https://docs.google.com/presentation/d/14EHyXnY9GEOjSKr-hz4MbnL4VJDQgb_3rEiGKotQcl0/edit?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1183387332, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7V2U7NIOVEWTMCIUYDVT3QBTANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

carlocolantuoni commented 2 years ago

as discussed on email: lets also implement a way to project simple, unweighted gene carts. this would be a simply add-on to the R script we currently use for projection.

On Wed, Jul 13, 2022 at 2:14 PM Carlo Colantuoni @.***> wrote:

totally agree - that flow looks perfect the specific problem we are hitting with these datasets is due to my own omission of ensembl IDs in the more distant past when i was using only gene symbols in the upload. but of course as there was not a check in the past, many other datasets could be similar. so going thru all current datasets and checking new ones would be perfect to make sure everything works

On Wed, Jul 13, 2022 at 11:45 AM Joshua Orvis @.***> wrote:

Also, rather than having this be a part of the execution when a user requests a projection we should instead do the following:

  1. Pass through all current datasets and add IDs for any datasets which don't have them.
  2. Ensure that the upload process involves dataset validation so that all have identifiers AND gene symbols at the beginning.

I've made a flow chart of this here https://docs.google.com/presentation/d/14EHyXnY9GEOjSKr-hz4MbnL4VJDQgb_3rEiGKotQcl0/edit?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1183387332, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7V2U7NIOVEWTMCIUYDVT3QBTANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

-- Carlo

carlocolantuoni commented 2 years ago

also as discussed on email:

we will expand where we can get gene carts from and what we can do with them - e.g. already getting them from the compare tool and PCA is great. can we draw gene carts from the volcano plot in the multi gene view? where else does gEAR/NeMO perform analyses that would be useful for this? can we associate a gene cart with a particular dataset (e.g. PCA of dataset X with dataset X)?

further, performing simple mathematical operations/transformation on gene carts and visualizing gene carts will be extremely important in understanding and using gene carts, e.g. plot one against another or log transform a weighted gene cart.

is there a gene cart manager ticket i should move these points to?

adkinsrs commented 2 years ago

@carlocolantuoni I can make a separate ticket for each of these (will do next week).

I have started refactoring some code to accommodate for the unweighted gene cart changes

carlocolantuoni commented 2 years ago

Great! Thanks shaun

On Sat, Jul 16, 2022, 21:02 Shaun Adkins @.***> wrote:

@carlocolantuoni https://github.com/carlocolantuoni I can make a separate ticket for each of these (will do next week).

I have started refactoring some code to accommodate for the unweighted gene cart changes

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/243#issuecomment-1186359494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7QNBARVF5NHB3H2TNTVUNLSRANCNFSM5QNM4E6Q . You are receiving this because you were mentioned.Message ID: @.***>