Can I integrate bulk omics data with GLUE?

konsta-kukkonen commented 1 year ago

Hello and thank you for very interesting software!

The GLUE framework seems to be designed for single-cell omics data in mind. Is it possible to integrate bulk ATAC and RNA sequencing data with GLUE? I'm not very familiar with machine learning, but I understand that the training requires lot of data and you reference in the paper that below 2000 cells the alignment error starts to increase. Is it a lost cause to try to use GLUE with this type of data? I have multiple cell lines, treatments, and replicates from each condition, but not nearly enough samples to resemble anything like a single-cell experiment.

Apologies for my lazines, I haven't tried pre-processing the data in the form that is used as input by GLUE. I haven't used python much and wanted to first get your general opinion.

Best, -Konsta

Jeff1995 commented 1 year ago

Hi Konsta! Thanks for your interest in GLUE. You are right, GLUE does not work well with small sample sizes (i.e., <2,000) because it is an over-parameterized neural network model and definitely needs a reasonable number of samples to train. Nevertheless, there are certain things you can try to make it more small-sample-friendly:

Decrease the number of features (e.g., genes, peaks) used as input. Bulk data probably don't have that much subtle variation to distinguish when it comes to samples, so maybe a small set of well-selected features would be sufficient.
Decrease the input preprocessing dimensionality. The rationale is the same. Bulk data typically don't need 50 PCs to capture the majority of data variation. Something like 5 PCs would be more appropriate?
Scale down the model architecture by reducing the number of hidden layers (to 1), reducing the number of hidden layer dimension (e.g., to 32).

These might increase the possibility of getting a reasonably trained model. Let me know if there were further issues, or if it actually works (I'm also interested 😀)

konsta-kukkonen commented 1 year ago

Thank you for a prompt response Zhi-Jie. I will try your suggestions and report whether it works! :)

konsta-kukkonen commented 1 year ago

I'm slowly moving on with the analysis. I'm wondering how the downscaled parameters can be passed to the scglue.models.fit_SCGLUE() model fitting function.

I tried to define new model by:

my_mod=scglue.models.scglue.SCGLUEModel(adatas={"rna":rna, "atac":atac}, vertices=guidance.nodes, latent_dim=5, h_depth=1, h_dim=32, dropout=0.2, shared_batches=False, random_seed=0)
my_mod.compile()

which worked. But when passing it as a parameter for the fit.SCGLUE function it raises error:

Traceback (most recent call last):
  File "./Model_training.py", line 154, in <module>
    glue = scglue.models.fit_SCGLUE(
  File "path/to/scglue/models/__init__.py", line 204, in fit_SCGLUE
    pretrain = model(adatas, sorted(graph.nodes), **pretrain_init_kws)
TypeError: 'SCGLUEModel' object is not callable

I understand that the correct type of the "model" object would be "type" as described in the read the docs documentation page, and the object I created is of "scglue.models.scglue.SCGLUEModel" type. How do I make a model object of the correct type with my selected parameters?

It's possible I'm doing something that is obviously wrong, but as said I'm not very experienced with python, and it has been a learning process even to get to this point. 😅

Thanks, -Konsta

Jeff1995 commented 1 year ago

Hi Konsta! As you have found out, the model argument in the fit_SCGLUE function only accepts a model type, rather than an already constructed model object.

To tinker the model structure, you can specify the model construction arguments using the init_kws argument, which will be passed on to construction of model objects inside the fit_SCGLUE function.

For the above example, you may use something like this:

my_mod = scglue.models.fit_SCGLUE(
    adatas={"rna": rna, "atac": atac},
    guidance,
    init_kws=dict(
        latent_dim=5,
        h_depth=1,
        h_dim=32,
        dropout=0.2,
        shared_batches=False, 
        random_seed=0,
    )
)

konsta-kukkonen commented 1 year ago

Oh, So those parameters should be passed to init_kws, not to model. Got it! Thank you

gao-lab / GLUE

Can I integrate bulk omics data with GLUE? #61