IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
10 stars 5 forks source link

add new projection methods #594

Open carlocolantuoni opened 9 months ago

carlocolantuoni commented 9 months ago

these will include call outs to R just as @adkinsrs has already implemented:

1] new nmf method called "fixed gene weights in NMF re-run", when we add this methods, we should rename the old NMF projection method "least squares optimization for NMF"

2] 2siLCA

3] structured (2sLCA, 2siLCA, & jointNMF) - selection of multiple gene carts, careful selection of datasets

adkinsrs commented 9 months ago

Is there any documentation on these new projection methods? Are they implemented in projectR (currently we only use the R package to project with least-squares NMF)

carlocolantuoni commented 9 months ago

they will all be calls to R - we are collecting info on exactly how to do these - ready soon

On Wed, Sep 6, 2023 at 9:06 AM Shaun Adkins @.***> wrote:

Is there any documentation on these new projection methods? Are they implemented in projectR (currently we only use the R package to project with least-squares NMF)

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1708323492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7XAM6RJ5MKAICLJLIDXZBYMVANCNFSM6AAAAAA4NJXZJQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Carlo

carlocolantuoni commented 7 months ago

in addition to the new projection methods, we need to structure the "chuncking" done in all projection methods. randomly selecting samples for the chuncks would work here.

carlocolantuoni commented 7 months ago

the new projection method will be in or SJD package, so it might be worth installing that now in preparation in case there are dependency issues to sort out - yo can install SJD with this:

library(devtools) install_github("CHuanSite/SJD") library(SJD)

after that we will be using the "projectNMF" function to do the projection itself, the code will look something like this:

projectionX=projectNMF(proj_dataset=ExpressionMatrix, proj_group=TRUE, list_component=WeightedGeneCart)$proj_score_list

carlocolantuoni commented 7 months ago

following the new projection methods we will need to run a few lines of code to "balance" the chuncks of samples that were run separately - will also send that code asap

adkinsrs commented 7 months ago

Not sure if you are aware but SJD cannot be installed on R v4.3 (which is what I had locally)

1: packages ‘Biobase’, ‘biomaRt’ are not available for this version of R

However, the nemo-prod server uses R v4.1, and after attempting to install SJD there I got info that "biomaRt" is not available there.

However, once I installed biomaRt using the BiocManager, I was able to get SJD installed afterwards :-)

carlocolantuoni commented 7 months ago

Aaaah dependencies! Glad u got it to finally work.

On Thu, Oct 26, 2023, 10:59 Shaun Adkins @.***> wrote:

Not sure if you are aware but SJD cannot be installed on R v4.3 (which is what I had locally)

1: packages ‘Biobase’, ‘biomaRt’ are not available for this version of R

However, the nemo-prod server uses R v4.1, and after attempting to install SJD there I got info that "biomaRt" is not available there.

However, once I installed biomaRt using the BiocManager, I was able to get SJD installed afterwards :-)

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1781302588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7RNQ6Z7ESHHJJD47VLYBJ3E5AVCNFSM6AAAAAA4NJXZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGMYDENJYHA . You are receiving this because you authored the thread.Message ID: @.***>

adkinsrs commented 7 months ago

following the new projection methods we will need to run a few lines of code to "balance" the chuncks of samples that were run separately - will also send that code asap

@carlocolantuoni, is it fair to shuffle the dataframe for all algorithms, or does it need to just be the NMF only? Was wondering since doing a shuffle for all would save some lines of code.

carlocolantuoni commented 7 months ago

the shuffle would be fine for all methods👍

On Fri, Oct 27, 2023 at 2:13 PM Shaun Adkins @.***> wrote:

following the new projection methods we will need to run a few lines of code to "balance" the chuncks of samples that were run separately - will also send that code asap

@carlocolantuoni https://github.com/carlocolantuoni, is it fair to shuffle the dataframe for all algorithms, or does it need to just be the NMF only? Was wondering since doing a shuffle for all would save some lines of code.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1783318294, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7SOKYHA53LEVZYARJ3YBP2VVAVCNFSM6AAAAAA4NJXZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGMYTQMRZGQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

adkinsrs commented 7 months ago

these will include call outs to R just as @adkinsrs has already implemented:

1] new nmf method called "fixed gene weights in NMF re-run", when we add this methods, we should rename the old NMF projection method "least squares optimization for NMF"

2] 2siLCA

3] structured (2sLCA, 2siLCA, & jointNMF) - selection of multiple gene carts, careful selection of datasets

So I am currently working on 1), which I will abbreviate as fixedNMF in our projection JSON files (so I don't need to rename the original NMF). Are there formulas for the methods in 2) and 3)?

carlocolantuoni commented 7 months ago

thats right, and naming sounds good. for #2 we are working on it and #3 will be a bit more complicated - we can discuss wed

On Mon, Oct 30, 2023 at 9:29 AM Shaun Adkins @.***> wrote:

these will include call outs to R just as @adkinsrs https://github.com/adkinsrs has already implemented:

1] new nmf method called "fixed gene weights in NMF re-run", when we add this methods, we should rename the old NMF projection method "least squares optimization for NMF"

2] 2siLCA

3] structured (2sLCA, 2siLCA, & jointNMF) - selection of multiple gene carts, careful selection of datasets

So I am currently working on 1), which I will abbreviate as fixedNMF in our projection JSON files (so I don't need to rename the original NMF). Are there formulas for the methods in 2) and 3)?

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1785200599, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7UMR6SSORGJQM7Q4UDYB6TSZAVCNFSM6AAAAAA4NJXZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBVGIYDANJZHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

adkinsrs commented 7 months ago

@carlocolantuoni I implemented the "fixed nmf" algorithm (and was scared that it worked the first time). But is it common for the "fixed NMF" to give identical output to the least-squares NMF algorithm?

carlocolantuoni commented 7 months ago

not sure how thats possible. how did u implement the "fixed nmf"? using what function and arguments? it shouldn't be working yet.

the output of the 2 methods will likely be close but not identical

adkinsrs commented 7 months ago

Here is the code I have

    # Run project R command.  Get projectionPatterns matrix
    try:
        if algorithm == "nmf":
            projectR = importr('projectR')
            projection_patterns_r_matrix = projectR.projectR(data=target_r_matrix, featureLoadings=loading_r_matrix, full=False)
        elif algorithm == "fixednmf":
            sjd = importr('SJD')
            """
            projectionX=projectNMF(proj_dataset=ExpressionMatrix, proj_group=TRUE, list_component=WeightedGeneCart)$proj_score_list
            """
            projection_patterns_r_matrix = sjd.projectNMF(proj_dataset=target_r_matrix, proj_group=True, list_component=loading_r_matrix)
            print(projection_patterns_r_matrix, file=sys.stderr)
        else:
            raise ValueError("Algorithm {} is not supported".format(algorithm))
adkinsrs commented 7 months ago

not sure how thats possible. how did u implement the "fixed nmf"? using what function and arguments? it shouldn't be working yet.

the output of the 2 methods will likely be close but not identical

I realized what happened. I had not uploaded my code changes to the Google Cloud Run service, but I still had my docker image configured to send the chunks there instead of running locally (where the updated code resided).

carlocolantuoni commented 7 months ago

ok - lets hold off on this til i have 1] the final version of the projectNMF function built into the SJD package, and 2] the extra code specifically/only for this fixedNMF method that will balance/calibrate the sample chunks that are run separately.

On Wed, Nov 1, 2023 at 11:58 AM Shaun Adkins @.***> wrote:

not sure how thats possible. how did u implement the "fixed nmf"? using what function and arguments? it shouldn't be working yet.

the output of the 2 methods will likely be close but not identical

I realized what happened. I had not uploaded my code changes to the Google Cloud Run service, but I still had my docker image configured to send the chunks there instead of running locally (where the updated code resided).

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1789215762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7UBUL4KBIW54U3SBHLYCJWUFAVCNFSM6AAAAAA4NJXZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZGIYTKNZWGI . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo

carlocolantuoni commented 6 months ago

hey @adkinsrs - once u have SJD installed this is the R code will need - oddly structured, but should work:

# new fixed NMF projection

library(SJD)

# "exprs" is the expression data matrix with genes in rows and samples/cells in columns
# "cart" is the gene cart, a matrix of loadings with genes in rows and patterns in columns

cartLIST=list(genesig=cart)# just becase projectNMF() below needs this to be a list

xproj=projectNMF(proj_dataset=exprs,proj_group=TRUE,list_component=cartLIST)$proj_score_list$genesig

# "xproj" will be the matrix projection values/embeddings to display with patterns in rows and samples/cells in columns

let me know if this is clear and if it works

carlocolantuoni commented 6 months ago

dont know why the code came thru with the crazy formatting - let me know if i need to send again - here it is attached as text fixNMFprojection_forShanun.txt

adkinsrs commented 6 months ago

dont know why the code came thru with the crazy formatting

You used "#" at the beginning of lines, which is a "header" or "subtitle in Github. Next time add "```" to the beginning and end of your code, and it will be treated as a code block. For now, I'm going to edit your comment to correct this.

carlocolantuoni commented 6 months ago

👍🏽

On Thu, Nov 30, 2023, 08:52 Shaun Adkins @.***> wrote:

dont know why the code came thru with the crazy formatting

You used "#" at the beginning of lines, which is a "header" or "subtitle in Github. Next time add "```" to the beginning and end of your code, and it will be treated as a code block. For now, I'm going to edit your comment to correct this.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1833823922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7SKSCD3WGTYLM4G633YHCFRRAVCNFSM6AAAAAA4NJXZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHAZDGOJSGI . You are receiving this because you were mentioned.Message ID: @.***>

carlocolantuoni commented 6 months ago

hey @adkinsrs - we need to add a few lines of code to do a column normalization on the WHOLE matrix BEFORE sending it out in chunks for projection - will that be possible? maybe call a high memory node 1st to do that and then several small nodes for the projection in chunks?

cs = colSums(exprs)
cs = cs/min(cs)
exprs=sweep(exprs, 2, cs, "/")
adkinsrs commented 6 months ago

Would this apply to all projection methods, or just this fixed-gene NMF function?

I assume the first two lines are taking the sums of all expressions per sample, and then dividing each sum by the smallest sum recorded to get a normalized value per sample. After this, for each gene you are dividing each expression value by it's normalized sample value to get a normalized expression for the gene+sample.

Let me know if that is not right? I don't think it's going to be necessary to use another hi-mem node to do this. Python can easily use numpy functions to do vectorized operations across a data series, which can save memory.

carlocolantuoni commented 6 months ago

yes thats right - and tihs is just for the fixedNMF procedure. so r going to try to do this in python then? "sweep" in this case divides every number in a column by the (normalized) column sum. help in R: sweep package:base R Documentation

Sweep out Array Summaries

Description:

 Return an array obtained from an input array by sweeping out a
 summary statistic.

Usage:

 sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)

Arguments:

   x: an array, including a matrix.

MARGIN: a vector of indices giving the extent(s) of ‘x’ which correspond to ‘STATS’. Where ‘x’ has named dimnames, it can be a character vector selecting dimension names.

STATS: the summary statistic which is to be swept out.

 FUN: the function to be used to carry out the sweep.

check.margin: logical. If ‘TRUE’ (the default), warn if the length or dimensions of ‘STATS’ do not match the specified dimensions of ‘x’. Set to ‘FALSE’ for a small speed gain when you know that dimensions match.

 ...: optional arguments to ‘FUN’.

Details:

 ‘FUN’ is found by a call to ‘match.fun’.  As in the default,
 binary operators can be supplied if quoted or backquoted.

 ‘FUN’ should be a function of two arguments: it will be called
 with arguments ‘x’ and an array of the same dimensions generated
 from ‘STATS’ by ‘aperm’.

 The consistency check among ‘STATS’, ‘MARGIN’ and ‘x’ is stricter
 if ‘STATS’ is an array than if it is a vector.  In the vector
 case, some kinds of recycling are allowed without a warning.  Use
 ‘sweep(x, MARGIN, as.array(STATS))’ if ‘STATS’ is a vector and you
 want to be warned if any recycling occurs.

Value:

 An array with the same shape as ‘x’, but with the summary
 statistics swept out.
adkinsrs commented 6 months ago

so r going to try to do this in python then?

Yes, the idea is to do the equivalent command using the Python pandas package, which is very much doable. I'm trying to see if scikit-learn has a normalize function for this particular normalization method since that would be even more efficient, but it may be overkill whereas writing pandas code should be efficient enough.

carlocolantuoni commented 6 months ago

👍

On Thu, Nov 30, 2023 at 3:25 PM Shaun Adkins @.***> wrote:

so r going to try to do this in python then?

Yes, the idea is to do the equivalent command using the Python pandas package, which is very much doable. I'm trying to see if scikit-learn has a normalize function for this particular normalization method since that would be even more efficient, but it may be overkill whereas writing pandas code should be efficient enough.

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/594#issuecomment-1834502663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7T2QJA4KPWLUCOILL3YHDTUHAVCNFSM6AAAAAA4NJXZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZUGUYDENRWGM . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo