Questions about your article Nat Commun 12, 3039 (2021) on archetype analysis

sjasws commented 1 year ago

Dear sir or madam, Now, I am constructing an archetype model on my scRNA-seq data the same way you did in "Evolution of core archetypal phenotypes in progressive high grade serous ovarian cancer. Nat Commun 12, 3039 (2021)." However, I had several problems choosing the number of archetypes. I tried to reproduce the figures in your article, but failed. Can you give me some guidance? Thanks!

Could you please tell me how many principal components (PCs) you used in archetype model construction, to finally formed figure S4A? I found that in order to caculating t-ratio, it is at least to input 7 PCs to get the result of 8 archetypes, but thus can not get the p-value<0.01 in figure S4B
I am trying to use the fit_pch() function of ParetoTI package in R to construct the model. My input data is the results of PC analysis, which is contain some PCs data of thousands of cells, and present as PCs in row and cells in column. I ran the following code: fit_pch(data, noc = as.integer(4), delta = 0) The output result contains the XS, S, C (0 for all C) ... t-ratio, with the var_vert and total_var are NA. Thus, I can not reproduce the figure S4B in Nat Commun 12, 3039 (2021). Could you please point out what's wrong with me? What is the code that produces figure S4B, if possible?

I am not a native English speaker, so please point out if there are any problems with my description and let me try to describe it in detail again. Looking forward to your reply and appreciate your possible help!

aritronath commented 1 year ago

Try using the transpose of the input PC matrix with 5 to 10 PCs: t(data) arc_ks_t = k_fit_pch(data = t(integrated.10pcs), ks = 3:8, check_installed = TRUE, bootstrap = FALSE, bootstrap_N = 10, sample_prop = 0.65, bootstrap_type = "s", seed = 345, simplex = FALSE, var_in_dims = FALSE, normalise_var = TRUE)

plot_arc_var(arc_ks_t, type = "varexpl", point_size = 2, line_size = 1.5) + ylim(0, 1) + theme_classic(base_size = 8)

The p-value plot is generated by plotting the randomise_fit_pch function. However, I believe the ParteoTI R package is not being developed further. For similar issues that I encountered, I recommend the MATLAB PartTI package to confirm the number of archetypes and get pvalues with raw data (not PCs).

sjasws commented 1 year ago

Thank you for your reply! I really ran the k_fit_pch(), plot_arc_va() and randomise_fit_pch() after read your available codes, and got the results.

But I found when input different number of PCs, it worked out different p-value after running randomise_fit_pch() for t-ratio and var_expl. I believe the key is choosing how many PCs is appropriate to enter, but how to decide it? Depend on the plot_arc_var(arc_ks_t, type = "varexpl", point_size = 2, line_size = 1.5) + ylim(0, 1) + theme_classic(base_size = 8)? Isn't this function used to determine how many archetypes?
I also used MATLAB PartTI package to confirm the number of archetypes and get pvalues with the raw data intergrated and normalized by RPCA (you used CCA-normalized count data). But I found the intergrated-normalized count data only have 2000 highly variable genes of the cells, thus seems not the completed features of the raw data. How did you solved this? Thank you again for your help!

aritronath commented 1 year ago

These are important questions. I would suggest using a number of PCs that is most appropriate for the data (e.g. based on an elbow plot). The highly variable genes used in RPCA likely contribute the most to your PCs so should be appropriate for use.

sjasws commented 1 year ago

These are important questions. I would suggest using a number of PCs that is most appropriate for the data (e.g. based on an elbow plot). The highly variable genes used in RPCA likely contribute the most to your PCs so should be appropriate for use.

My question now is how to generate the elbow plot in ParetoTI R to guide the appropriate number of PCs selection. It seems did not exist the function in R, so that hardly to decided. I have notice you used the first five PCs in the your study, I think that would be fine. But when I want to reproduce figure S4 A and B, the R will report warnning that can not calculate t-ratio when run k_fit_pch() and fit_pch() for more than 7 archetypes (less than or equal to 6 is feasible), as well as run randomise_fit_pch() to get the p-value. This is the root cause of my annoyance. I really can't find the appropriate method and basis to decide the number of PCs to input. Thank you!

aritronath commented 1 year ago

This can be done using base R functions to create scree/elbow plots or using other packages. Seurat has an ElbowPlot function for this purpose.

sjasws commented 1 year ago

It's fine to do the ElbowPlot function in Seurat to determine the number of PCs, thank you. By the way, the elbow plots generated by the MATLAB PartTI suggested to use 3 PCs to construct 4 archetypes. I followed its advice, but ended up with a p-value of 0.16 for the t-ratio, which seems to be a result of poor adaptation to data. I found that if I increase the number of archetypes I would decrease the P-value, but this seems to violate the method of selecting the number of archetypes. How should I trade-off between ESV and P-value?

U54Bioinformatics / 03A_scRNA_Archetype_Mutitasklearning_Analysis

Questions about your article Nat Commun 12, 3039 (2021) on archetype analysis #2