local vs global optimal path

wangmhan commented 1 year ago

Hi,

Thank you for the nice package!

I have some problems related to the pathfind function. Does it first find the global optimal path, then cut high dissimilarity out? 1) In my case, I changed my gene list of interested GO term slightly: 150 gene in the beginning, then add 60 more genes in. The heatmap of dissimilarity looks almost the same, but the path changed a lot!

result of 150 genes: gochick_heatmap gochick_align

result of 210 genes: gounion_heatmap gounion_align

2) the path it identified seems not the optimal ones. the gray path is the one identified by pathfind. but by eyes, the red line make more biological sense, also seems better if consider it locally. example1

Do you have any suggestions to get more stable and optimal result? Many thanks!

No2Ross commented 1 year ago

Hi there!

Thanks for showing an interest.

There is two ways that TrAGEDy cuts out high dissimilarity. The first is before the global optimal path is calculated (see the top half of supplementary figure 2 of the TrAGEDy paper) and the second (see the bottom half of supplementary figure 2 of the TrAGEDy paper) does it after the global path is identified.

In terms of adding the 60 genes, you're right the paths have changed, and this might be an area where TrAGEDy might not be suited. TrAGEDy assumes that their is a continious alignment path between your two datasets, but you can see your dissimilarity matrix is quite 'spiky' (i.e. areas of high and low dissimilarity occuring right next to each other).

These 'spiky' regions can sometimes be caused by your pseudotime values not being accurate. What packages and code are you using to generate them

Another thing i've noticed is that your plot dissimilarity scores don't start from 0. In the latest version of TrAGEDy on github, the plot dissimilarity score legend should always start from 0, otherwise very dissimilar regions can have a blue colour (e.g. the last plot you've posted) and make it seem like there is an alignment where there is none.

Hope this is helpful!

-Ross

wangmhan commented 1 year ago

Thanks for the quick reply!

Could you give an example to show how to cut out high dissimilarity before the global optimal path? In the tutorial, it showed how to cut out after global optimal path if I understand correct?

In my dataset, our hypothesis is the start points of the two lineage is quite different, but after the treatment their end points should converged. So that it might not be always continuous through the trajectory. Do you think TrAGEDy suited in this case?

I used Slingshot to generate the pseudotime value. I subset the cells on the interested branch for following analysis. I checked the dynamic of some known markers and the temporally changes are consistent with literature. I also have a question related to that. The maximum value of the two trajectory are quite different, one is around 3 the other is around 6. In this case, do you think I should normalise the maximum to 1 to make them more comparable?

There is a batch effect between my two datasets. I don't know which normalised value as input matrix is better: logNormalize, or scaleData, or scaleData regress out cell cycle. Have you ever tested these normalised value to see which has a better performance?

I modify the code a bit, change the score legend from 0 to minimum value.

Do you have any suggestions for my second question? The optimal path by eyes (the red line) is quite different from the one picked by dtw (the grey line).

Best regards, Menghan

No2Ross commented 1 year ago

Hi Menghan,

It would be difficult to remove high dissimiilarity regions before global optimal path as we use the global path alignment as a model to set thresholds for removing high dissimilarity regions.

In the scenario described, TrAGEDy could work in your scenario but you would have to change the cut_type parameter in the pathfind from "minimum" to "none". I say this because if we have the following scenario:

Because we have no alignment at the beginning, it tries to find the optimal start point and instead picks a region which (probably due to noise) has slightly lower dissimilarity. This means it cuts some optimal alignment parts from your final alignment, because it ocurred before the 'optimal' point of initial alignment. By setting cut_type to none, it takes the very start and very end as the points of initial and end alignment, and we'd capture the section properly.

In future versions of TrAGEDy we will add a parameter which will ask whether the users want to cut: the start, the end, both or none of the dissimilarity matrix, which will deal with this problem a lot more elegantly.

In terms of differences in pseudotime values, none of the tests we've done have been on datasets with the same pseudotime scales, however we've not dealt with anything as extreme as pseudotime scales that are half of the other dataset in the comparison. I've just ran the alignment code for our start_shared simulated alignment (Fig 2A in the paper) where I made the pseudotime of the WT dataset half that of the KO (WT max pseudotime = 10.53, KO max pseudotime = 21.07) and it the alignments looked very similar:

In terms of batch effects all our results are done on the normalized data. Because we use Spearman's correlation for alignment, it doesn't matter whether the expression values are scaled or not. The T cell dataset in the TrAGEDy paper seemed to have a pretty substantial cell cycle batch effect. We controlled for this in terms of pseudotime by regressing out cell cycle genes when we scaled the data and we used the normalised data for dissimilarity estimation.

In terms of the grey vs your red line path I'd say there should be no alignment between the two datasets (at least on the genes you've chosen). If you kept the scale as it is originally, it would probably be all red; the lowest dissimilarity you have is 0.8 which is a spearmans correlation of 0.2, which is quite low.

I will say that TrAGEDy is not the ultimate authority on alignments, we could get it wrong. We've tested it on quite a few datasets and it's been able to return (what we think) are biological accurate alignments but we could always be wrong with other ones. Does the data integrate together with harmony or Seurat?

Also how are you choosing the genes that define the alignment?

Best wishes, Ross

wangmhan commented 1 year ago

Hi Ross,

Thank you very much for the nice illustration!

I'm using Seurat integrateData to minimise batch effect. But it a bit overfitting here, as the dissimilarity score is quite low. So I just used LogNormalize data from UMI for the analysis.

The gene list I used is from a specific GO term.

Will definitely try out the parameters you suggested. : )

Best, Menghan

No2Ross commented 1 year ago

We've only tested TrAGEDy on capturing a whole biological process (i.e. T brucei differentiation and T cell development) we've not tested whether it will work on just one GO term so that might be the issue.

If you have time, I would try on some marker genes of your two dataset clusters and then see if you can identify any of the genes in the GO term as being differentially expressed across the alignment.

Let me know how changing the parameters goes and good luck.

Best wishes, Ross

No2Ross / TrAGEDy

local vs global optimal path #2