Integrate RMS data and add param for choosing projects to integrate

allyhawkins commented 2 years ago

Closes #180 ⚠️ Stacked on #178

Here I made the necessary adjustments to integrate the RMS project after adding cell type annotations to the SCE objects as noted in #178. First I added the libraries to the scpca-processed-libraries.tsv from the RMS data that we want to use for integration. Right now we are only starting with a subset of libraries from the project rather than integrating the entire dataset. Because we are considering using this dataset for advanced single-cell (and therefore potentially doing DE between subdiagnosis types), I included both libraries from ARMS and ERMS to start. Based on how this looks I might pare this down to just have three libraries from each group or also create separate integrated datasets for just ARMS and just ERMS if integration looks suspicious when using all subtypes.

I also made a few changes here to help us as we start testing things with ScPCA libraries and added functionality to specify which project we want to run through the workflow rather than running all projects by default. To do this I added a new param to the config file, groups_to_integrate, which by default is set to "All". If its set to "All", then all the projects present in the pep file will be considered, otherwise a list of project groups can be provided and only those projects will be processed. This should allow us to run the workflow with the same metadata file but only specifying the RMS project or the Gawad project, similar to how we run scpca-nf by project.

To run the updated snakefile with the RMS data only I ran the following:

snakemake -c1 --use-conda --configfile config/config-scpca.yaml --config groups_to_integrate="SCPCP000005" sce_dir="results/scpca/celltype_sce"

If you wanted to do multiple groups you would change to use groups_to_integrate='["SCPCP000001","SCPCP000005"]'.

I am leaving this in a draft state because it does run through the workflow but fails because of the celltype issue with ASW noted in #182. Since that is almost done, I'll wait until its merged and then re-run with those updates.

Another thing to note is that the input sce directory is different for the RMS data because we have a celltype_sce directory that has the SCE objects with the celltype column. One thought I had was to add the input SCE directory to the metadata file so that if there are cases where projects have different input directories that can be stored in the metadata file rather than specifying that at the command line. I didn't make that change yet, but if other people are in agreement I can make that change.

allyhawkins commented 2 years ago

I went ahead and added the max_celltypes to the config file so that it can be altered at the command line when running this. Mainly because the top cell types identified in the RMS data were all tumor subtypes, so I wanted to see what some of the other cell types looked like. When I ran it I expanded to 10 of the top cell types.

I also had to make an adjustment to how the batches are labeled and created a batch_colors variable based on the number of batches being used. This is because there were 10 batches in this dataset and without providing colors to the plots it was using a color palette that only had 9 colors. We might need to add some checks to make sure that the number of batches being plotted aren't > the number of colors available in the palette being used too.

Also including the current report for the RMS data that was created and marking this ready for review. SCPCP000005_integration_report.html.zip

sjspielman commented 2 years ago

Hey @allyhawkins, getting to this today! A couple things first..

I think it's worth quickly merging main into these two branches (target rms-celltypes and the present branch rms-integrate) since a couple things are going to be influenced, and there's potential for conflicts. That said, rms-celltypes is approved (caveat only be me though!) so it could be merged into main and target updated here.
For the batch_colors, I'm assuming the ASW plot is what broke, right? It defaults to Okabe_Ito if colors aren't provided and that palette is capped at n=8.

allyhawkins commented 2 years ago

I think it's worth quickly merging main into these two branches (target rms-celltypes and the present branch rms-integrate) since a couple things are going to be influenced, and there's potential for conflicts. That said, rms-celltypes is approved (caveat only be me though!) so it could be merged into main and target updated here.

I did merge main into rms-celltypes and then merged that branch into this one since they are stacked, so it should be up to date since #182 has been added.

For the batch_colors, I'm assuming the ASW plot is what broke, right? It defaults to Okabe_Ito if colors aren't provided and that palette is capped at n=8.

Yes the ASW plot is what broke because of that default. I think I want to leave using that if no colors are provided in the function, but in the template I went ahead and set the colors to use for batches using a larger palette, the rainbow palette. But we could get fancier with that and for datasets with < 8 libraries we could stick with Okabe_Ito. I think the key will just be making sure we address the colors in the template based on how many batches are present in the datasets being integrated.

sjspielman commented 2 years ago

U did merge main into rms-celltypes and then merged that branch into this one since they are stacked, so it should be up to date since https://github.com/AlexsLemonade/sc-data-integration/pull/182 has been added.

👍 sorry i missed that!! thanks

allyhawkins commented 2 years ago

I think you may be seeing an error because you are trying to use the same input sce directory for both projects here results/scpca/celltype_sce when SCPCP000005 is the only project present in that directory, while SCPCP000001 doesn't have any SCE's in that directory (or shouldn't).

Edit: I'm actually reading the error message now and seeing that there's an issue with reading in the argument so looking into that, but also the command you are using should fail for the reason I stated above.

allyhawkins commented 2 years ago

Looks like the error was just a missing pair of quotes! Should be good to retry @sjspielman

allyhawkins commented 2 years ago

I've been going back and forth on this myself, and I've decided I prefer the metadata file approach because this very concern would have gotten me earlier when I tried to run multiple samples, were it not for the missing quotes. If I'm understanding correctly, this means sce_dir will no longer be in the config nor will it be an option in 02-prepare-merged-sce.R since it will be a column in the metadata file, and just pulled from there. I'd be explicit with naming that column as well so that it's super distinguishable from the folder_structure column.

I went ahead and made that change explicitly adding a integration_input_dir column to all of the projects metadata files that are used for input to the workflow. I then removed it as a param in the config file and an option in the merging script, instead grabbing the directories from the metadata file prior to searching for the SCE files.

allyhawkins commented 2 years ago

Other scripts that I think need updating now too, but please check me!! We pass in a lot of directories that I think are no longer needed because they are now in the metadata file:

I think I want to keep these in for now because they will still work the way they are. I think we can file an issue and make an overall change to those.

allyhawkins commented 2 years ago

I made some changes in the main README to reflect the metadata files that we have now, not just the HCA files and also addressed your comment. I chose to file an issue about changing the other scripts because this PR is getting long and that isn't a required change for things to work, but would just be something nice to have to keep things in sync.

AlexsLemonade / sc-data-integration

Integrate RMS data and add param for choosing projects to integrate #183