emdann / milopy

Python implementation of Milo for differential abundance testing on KNN graph
MIT License
60 stars 7 forks source link

Refactoring with MuData #26

Open emdann opened 2 years ago

emdann commented 2 years ago

The original cell-level AnnData object and the sample x nhoods AnnData are stored in a common MuData object with .mod cells and samples. Samples are in mdata["samples"].obs and nhoods are in mdata["samples"].var.

Notes

To do

Zethson commented 2 years ago

@emdann

@xinyuejohn started with the integration of milopy into pertpy here: https://github.com/theislab/pertpy/pull/165 He could also tackle my feedback that I posted here if it's fine with you? Then we don't need to duplicate the work and John and me can figure this out directly.

What do you think?

emdann commented 2 years ago

Hi @Zethson Apologies for the late react on this and thanks a lot for the feedback. I am a bit slumped with some urgent work at the moment so I am more than happy for @xinyuejohn to take over and implement these changes during integration!

xinyuejohn commented 2 years ago

Great!

A few general comments:

  1. You should consider adding all doc related files to the gitignore. They pollute pull requests. A CI job should build the docs and with for example RTD integration you can also preview the docs before merging. This does not matter for our pertpy plans, but wanted to write it nevertheless.
  2. A couple of tools of pertpy have a "load" function that prepares the input for the subsequent tool. I could see us having a load function that returns a MuData object (by creating or appending) that is ready for all downstream Milopy steps. ->
milopy = pt.tl.Milopy()
mdata = milopy.load(adata)  # adds compositional adata object and returns mdata object
# alternative
mdata = milopy.load(mdata_old, rna="rna")  # where rna refers to the existing RNA modality in the AnnData object
  1. I didn't add this comment everywhere, but we should ensure that the naming is consistent with unwritten muon/MuData usage rules -> "rna" and "compositional" ("samples" might be too unspecified). Likely should even name it "milo_compositional", because tascCODA might add its own "compositional"

As for your comment 2, the current workflow works like this:

  1. Use milo.make_nhoods(adata, ...) to get neighbourhoods, which saved in adata itself
  2. Use milo_mdata = milo.count_nhoods(adata, sample_col="sample") to get a MuData object.

Therefore, if I need to add a mdata = milo.load() function, it should substitute the original milo.count_nhoods function. What do you think? Do you think it follows the best practice? @Zethson

Zethson commented 2 years ago

Therefore, if I need to add a mdata = milo.load() function, it should substitute the original milo.count_nhoods function. What do you think? Do you think it follows the best practice? @Zethson

What I had in mind is that the load function only generates the actual object that we will use for downstream tasks and not that it does any computation already. It basically only sets up the MuData object.

Does it make sense?

xinyuejohn commented 2 years ago

Therefore, if I need to add a mdata = milo.load() function, it should substitute the original milo.count_nhoods function. What do you think? Do you think it follows the best practice? @Zethson

What I had in mind is that the load function only generates the actual object that we will use for downstream tasks and not that it does any computation already. It basically only sets up the MuData object.

Does it make sense?

Makes sense to me and it makes the whole pipeline more clear. In this way, the MuData object the load function generates only includes one 'rna' modality. And all the downstream functions take this mudata as the input.