problem - Githubissues

Hananebh commented 1 year ago

Hello, I work with infercnv version 1.14 and I had a problem with the dendrogram and the final result which is completely different from the one I obtained with infercnv version 1.9

I would like to know what difference there is between the two versions and why I have a problem with the new dendrogram? Your help is highly appreciated! Thank you.

GeorgescuC commented 1 year ago

Hi @Hananebh ,

I think the dendrogram you are seeing in the new version is due to each cell being assigned its own subcluster by the Leiden clustering, but I would need to know what options you used in both runs to be confident.

The older version of infercnv used a different library for the Leiden clustering that has stopped being maintained since because it has been improved to work natively in R rather than require Python, as well as add more options. One of the options that has changed with the library is that the default scoring function changed from "modularity" to "CPM", which is theoretically an improvment, however the optimal 'resolution' parameter changed with it. A good start is to use 0.05 for resolution with "CPM" when you used 1 for resolution with "modularity" (only mode available in 1.9), then tweak the number as needed. Alternatively, you can change the scoring "leiden_function" back to the old "modularity" one. There might still be differences because of the different implementation, and because we now also run a PCA ("leiden_method" argument, old was the "simple" option) first.

Regards, Christophe.

On Fri, Jan 13, 2023, 04:21 Hananebh @.***> wrote:

Hello, I work with infercnv version 1.14 and I had a problem with the dendrogram and the final result which is completely different from the one I obtained with infercnv version 1.9 [image: Capture d’écran 2023-01-13 à 10 16 51] https://user-images.githubusercontent.com/55884781/212283332-dc7eeefd-96d5-4314-83de-b6cd7da61683.png

I would like to know what difference there is between the two versions and why I have a problem with the new dendrogram? Your help is highly appreciated! Thank you.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/infercnv/issues/496, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADC5EC5E35ECZN3ICNEJKFTWSENALANCNFSM6AAAAAAT2F3P34 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Hananebh commented 1 year ago

Hi @GeorgescuC , thank you for your answer and for your useful explanations. the code I used for both versions is the same: infercnv_obj = infercnv::run(infercnv_obj, cutoff=0.1, out_dir="sampleoutput" cluster_by_groups=TRUE, denoise=TRUE, HMM=TRUE, plot_steps=FALSE, num_threads= 8)

Capture d’écran 2023-01-14 à 13 39 52 how I can corrige the dendrogram please ? I want to keep the new result but with a googd dendrogram? Thanks, Regards

nigiord commented 1 year ago

Hi @Hananebh, I have the same issue with the last version of inferCNV (1.14). I think you could try the following parameters to get something similar to how the clustering was previously done in 1.9:

infercnv::run(
  infercnv_obj,
  […]
  k_nn=30,  # 1.9.1 default param
  leiden_resolution = 1,  # 1.9.1 default param
  leiden_method = "simple",  # 1.9.1 default param
  leiden_function = "modularity",  # 1.9.1 default param
  […]
  )

The results are not exactly the same, but at least they are rather similar.

I initially tried to reduce the resolution parameter with the PCA+CPM new-default approach as suggested by @GeorgescuC, but I need to use reaaaaaally low values (down to 0.0005) to have at least some groups that are not singletons, and the original subgroup annotations are poorly clustered, so I don’t think that’s the way to go.

Cheers, Nils

GeorgescuC commented 1 year ago

Hi @Hananebh and @nigiord ,

The settings that @nigiord posted should indeed give the closest results to previous versions of infercnv that used the python implementation of the Leiden algorithm instead of igraph's. There is one more setting that was added in 1.14 that can affect the subclustering and you might want to change, which is the masking of genes that have a z score over (by default) 0.8 in references, and is controlled by the z_score_filter option. This masking is done to ignore genes that show strong signal in references as is common for MHC genes in chromosome 6 for example. Looking at your plot, it might however mask more genes than expected due to the cluster of cells at the top of references that look different than the rest and have an residual expression pattern very similar to your observations.

Regards, Christophe.

deevdevil88 commented 1 year ago

hi @GeorgescuC i had a question following on from this. As I have also faced similar problem of the subclustering with leiden, which i fixed thanks to the settings @nigiord posted. Now since you mentioned the z score filter option, i m wondering when do you decide to increase this zscore_filter value, or rather in order for it to not mask more genes than expected is a zscore of 2 as a filter reasonable? my final plot after improving the leiden clustering looks like this. Now our intiial thoughts were that we still have some "potential tumour cells" that are in the reference, but based on what you just said it migh be the zscore_filter setting. I dont know how to distinguish if its one or the other. Your opinion is appreciated thanks! Screenshot 2023-02-16 at 10 45 10

Best, Devika

GeorgescuC commented 1 year ago

Hi @deevdevil88 ,

The z score filtering does not affect in any way the residual expression values show in the figures. The z score only masks genes when calculating the nearest neighbors for cells (either directly on the residual expression, or on a PCA of the residual expression) at the start of the Leiden clustering, but not anything else. The downstream effect that can be visible is for the HMM predictions, as that uses subclusters to combine information from clonal populations of cells, so inaccurate/overly fragmented subclusters would reduce HMM prediction accuracy.

In the figure you shared, there are 2 things I notice:

some extremely small groups of reference annotated cells, which, if not so small only because of compression on the plot, could mean mostly noise is used as one of the references.
Within your 1st and 2nd group of references cells, there appear to be respectively at least 3 and 2 different consistent patterns of expression that are not properly "zeroed" because they are mixed together. One of those patterns, the top cluster of cells in your 2nd reference group, seems to at least partially overlap with the signal seen in most observation cells in chromosomes 10 and 19.

I cannot tell from this if the cells you used as reference are actually healthy or not as infercnv is an analysis relative to what is provided as references, but it is worth looking more into the accuracy of the clusterings you defined for your references.

Regards, Christophe.

deevdevil88 commented 1 year ago

hi @GeorgescuC , Thank you for replying. Indeed you are right, currently this is our pilot and we dont have "proper normal" cells sampled for sure, currently the reference is all non immune cells that arent CA9 positive, but ofc its seems iike that isnt removing all tumour cells. So seems like that i need to clean up "reference cells more" and i will remove the small groups of reference annotated cells for now and re-run. thank you, these observations have been helpful. in the future we should actually have data from normal regions so that would solve the reference not being clean enough.

Best, Devika

GeorgescuC commented 1 year ago

Hi @deevdevil88 ,

One thing to keep in mind is that the normal baseline expressions are defined as the average expression in each reference. This means that having signal show up in your references does not necessarily mean that some of those cells are malignant/tumor. It simply means there is heterogeneity within the groups of cells you have defined as references. Conversely, if you define a homogenous group of tumor cells as references (a clonal expansion for example), they will not display any signal since they are considered one of the baseline expression levels, and normal cells used as observations might show signal that is the opposite of the event that happened in the tumor as the difference is relative.

A very common example of signal showing up in healthy references is MHC genes on chromosome 6 for immune cells: usually about half the cells show a loss signal while the other half show a gain signal. This observation was also the basis for adding the z score masking option during subclustering.

Regards, Christophe.

broadinstitute / infercnv

problem #496