AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
5 stars 14 forks source link

Celltype annotation for non-ETP T-ALL (SCPCP000003) #767

Closed UTSouthwesternDSSR closed 2 days ago

UTSouthwesternDSSR commented 1 week ago

Purpose/implementation Section

In this PR section, we will initialize an analysis module skeleton for non-ETP T-ALL dataset in SCPCP000003 and to provide scripts for processing the provided sce objects for review.

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

This PR files scripts for processing the provided sce objects using self-assembling manifold (SAM) algorithm as a soft feature selection strategy for better separation of more homogeneous populations. The saved rds objects will be used in the following analysis in my next PR.

Briefly describe the general approach you took to achieve this goal.

A raw count is provided to SAM algorithm with default parameters, where it will preprocess the data by log-normalizing the counts to the median total read count per cell and retaining the genes that expressed in less than 96% of total cells. Then, it will perform feature selection and dimensionality reduction, using 3000 genes selected based on the SAM weights prior to PCA, and with 150 principal components and 20 nearest neighbors to generate UMAP embedding. Leiden clustering is performed to provide clusterIDs. This workflow is applied to all 11 samples in the dataset.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Cell type annotation and identification of tumor cells will be filed soon in other PRs.

Results

The scripts/00-01_processing_rds.R will generate rds files in scratch/ and umap plots showing leiden clustering results in plot/00-01_processing_rds/.

What is the name of your results bucket on S3?

s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots/00-01_processing_rds

What types of results does your code produce (e.g., table, figure)?

Intermediate rds files are saved in scratch/ for further analysis, and the umap plots showing leiden clustering are saved in plot/00-01_processing_rds/.

What is your summary of the results?

The intermediate rds files are ready for cell type annotation and subsequently for tumor cells identification.

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for. This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

For the first PR, we would like to check if our way to setup the module skeleton, computing environment, and documentation has met the need.

Is there anything that you want to discuss further?

I failed to push the environment.yml file to the github, since it gives the warning/error of 3 leaks found in the pre-commit stage, as shown in the image below. These are the lines in environment.yml causing error:

line 99: - keyring=25.3.0=pyha804496_0
line 249: - r-rstudioapi=0.16.0=r44hc72bb7e_1

I generated the environment.yml via conda env export -n openscpca > environment.yml. I am wondering if you have any suggestion in solving this?

Screenshot 2024-09-19 at 12 18 18 PM

Thank you for reviewing and much appreciation for any suggestion provided!

Author checklists

Check all those that apply. Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

UTSouthwesternDSSR commented 1 week ago

Hi,

I am sorry that I accidentally closed the pull request. Regarding the SeuratDisk package, I managed to avoid it by providing the matrix file to SAM and using schard to read in h5ad file:

sam = samalg$SAM(counts = c(r_to_py(t(seu@assays[["RNA"]]@counts)), r_to_py(as.array(rownames(seu))), 
                             r_to_py(as.array(colnames(seu)))))
  sam$preprocess_data()
  sam$run()
  sam$clustering(method = "leiden") #leiden clustering is the default algorithm in SAM
  sam$save_anndata(paste0("sam.",sampleID[i],".h5ad"))
  final.obj <- schard::h5ad2seurat(paste0("sam.",sampleID[i],".h5ad"))

In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file. I deleted the anndata format due to disk storage issue. But I would need more storage for the other 30 ETP T-ALL samples.

jashapiro commented 1 week ago

No worries about the accidental closing!

I deleted the anndata format due to disk storage issue. But I would need more storage for the other 30 ETP T-ALL samples.

If you need more storage on an AWS Lightsail system, you will probably want to follow the directions here to attach additional storage: https://openscpca.readthedocs.io/en/latest/aws/lsfr/working-with-volumes/ One limitation is that you can only attach 128GB per volume, but you can attach multiple volumes if needed.

Another approach that I might recommend is to refactor your scripts a bit to work on a single sample at a time: This will allow you to only keep on disk the files for a single sample, and will also help us out in the future when we adapt the module for our workflow system.

jashapiro commented 1 week ago

In general, SAM outputs the gene weights, nearest neighbor matrix, and a 2D projection. I need the 2D projection for the better separation of cell clusters. I think it will be better to just keep all the output provided by SAM and just remove them later if I don't need it. As of now, I saved the SAM output in anndata format, and then read them back in and save them as final seurat rds file.

I guess I want to confirm that the SAM output still includes the original data and possibly previous normalizations and projections. For context, I am usually quite skeptical of 2D projections as a metric of cell separation; I would personally want to confirm the cluster separation based on a more direct measure of distance. But I know this is a thorny problem! I want to hold off a bit for now on this kind of evaluation, but I want us to be able to look at it later, which is the main reason that I am asking about what the output is. Either way, this should not be too much of an issue either way, as we can always go back to the original input data as needed.

UTSouthwesternDSSR commented 1 week ago

Actually SAM output doesn't retain the original data. I only provide the raw count, geneID, and cellID. Their log-normalization is different from the previous normalization provided, and I save the SAM log-normalization values and their projections in the rds file.

jashapiro commented 5 days ago

Actually, I have another concern regarding SAM and ScType. Each time I run SAM, the clustering output changes, although the seed is set as 0 (by default) in sam$run(). The log is shown as image below. Screenshot 2024-09-24 at 5 12 45 PM

Since ScType annotates by cluster, so the cell annotation may change if they are found in another cluster, with different annotation. This is actually quite likely to happen, especially for those cells that don't have clear markers expressed. Eg. Depending on the clustering results, they may be annotated as Unknown, if the score of a cluster is less than a quarter of #cells in that cluster.

This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to sam.run(distance='correlation', seed=0) may make the calculations consistent.

I think this limitation is actually quite out of date, and the hsnw library now supports seeding, but fixing that would require changes to samalg, and the package unfortunately seems not to be regularly updated. I might consider forking the package and testing whether the implementation can be updated to support the newer versions of the library, but that would require a fair amount of effort!

jashapiro commented 5 days ago

This seems like it may be a pretty significant concern! I was able to find this issue from the SAM github repository, which suggests that the problem lies in the distance calculations, which can not be seeded with the default setting. Changing to sam.run(distance='correlation', seed=0) may make the calculations consistent.

One other thought I had was whether you need the UMAP visualizations and clusters that are calculated by SAM? Can you recalculate those with Seurat? The reason I ask is that if the cell assignments in scType are sensitive to the clustering, you may well want to explore how different clustering parameters affect the quality of the assignments.

If you want to go down that route, I would probably start by breaking up the run_sam function to make it a bit more modular. One function could take the SCE object, pass it to SAM, and write the h5ad file, then a separate function would read the h5ad file and perform downstream steps like clustering. This will make it easier to modify those steps for exploration, assuming that the data required to pass to the clustering algorithm is properly read in from the h5ad file.

UTSouthwesternDSSR commented 3 days ago

Sure, I will make changes to the .csv file as suggested. ScType will provide label by returning the cellName column. As for the shortName column, it was used in another part of ScType, which I am not using. Just to check with you, which label would you all prefer for the cell type annotation, or it doesn't really matter?

jashapiro commented 3 days ago

ScType will provide label by returning the cellName column. As for the shortName column, it was used in another part of ScType, which I am not using. Just to check with you, which label would you all prefer for the cell type annotation, or it doesn't really matter?

I don't think it really matters, as long as the names are clear. At some point we will almost certainly have to do some harmonization across different references, but that will be well down the line (and not part of this module).

UTSouthwesternDSSR commented 3 days ago

Yes, I have completed all the changes that I have intended for this part (including editing the .csv file), and I am ready for a review. I will have to re-request review for it to be merged into the main branch, right?

Actually I have started working on the second part too. I have some technical questions, it seems like the Rstudio will crash when the memory reaches about 23GB, and the whole virtual machine freezes when I try to run CopyKat. I plan to download the rds file from S3 bucket to my local server for analysis with ./sync-results.py cell-type-nonETP-ALL-03 --bucket researcher-650251722463-us-east-2 --download --profile my-dev-profile, but it doesn't work. It is looking for git directory. I am wondering if I should clone the repository folder, or I will have to wait until my pull request get merged into the main branch?