CatExp plots - Githubissues

kamoors commented 2 years ago

Hi! me again...

I am now trying to use iMat++ for a completely different model (not iCEL, not Recon), and I'm currently using the CatExp.py stuff to categorize my gene expression data. However, upon using the gaussian curve plotting to obtain the thresholding values, I encounter some weird stuff.. Firstly, I cannot really grasp how to define the expectedStats values, and how exactly they affect the fitting (thus I'm leaving them as in your walkthrough) Second, when I do trimodal, I get a good R^2 value, but the mu values are not in the correct order (why?) - the mu1 matches the mean of the curve most to the right, mu2 matches the one on the left (which is negative?), and mu3 matches the curve in the middle. Is this normal?

Thus, later on I defined the rare cutoff as mu2, low cutoff as mu3, and high cutoff as mu1 - correct? trimodal {'mu1': 5.631200169345438, 'sigma1': 1.9850268078063493, 'mu2': 2.585393029325549, 'sigma2': 0.20107748436559067, 'mu3': 4.411776037922199, 'sigma3': 1.0847997002620264, 'N1': 2296.8310628226477, 'N3': 1795.7250793209037, 'N2': -50.55614214355137, 'Rsquared': 0.9960556525960251, 'RSS': 15.82686593199582, 'n': 100}

Thanks in advance :)

WalhoutLab commented 2 years ago

Hi Kamoors,

When the peaks are not clear or the default initial points (expected stats) are very different from where the actual peaks are, the initial points do matter, because the fitting can be stuck in a local minimum and you want to control where the local fitting starts (i.e., around the visible peaks).

In addition, a negative peak does not provide a valid (physically interpretable) set of statistics.

You should therefore change your initial points according to where you think the peaks of subpopulations are. For this distribution, it is hard to judge if you have a right or left shoulder (a shoulder would indicate the second subpopulation under the main peak) in your main population, or even if you have one. It is just clear that you have a blip on the left (seems like a third small subpopulation). So, I would set mu1 (the first peak) at around 2.3 (as far as I can judge where the blip is from the plot) with sigma1 around 0.5, and try to fit the main subpopulations with trial and error. For example, if we believe there is a right shoulder because of a tight subpopulation there, then something like mu2=5+/-1.5, mu3=7+/-0.7 could do (the second numbers are sigma values).

If the negative peak persists in your trials, I would fit two curves instead of three and manually judge the rare expression threshold based on the blip (unless, the blip becomes the secondary subpopulation and the main peak fits the rest well, that also seems possible in this plot and should definitely be tried if you have not already done so).

The order of curves in the output does not matter in the sense that it is not a technical problem; you should just decide on where the high, low and middle peaks are. However, if the order does not follow your initial points, that proves they were not effective in directing the fit.

Finally, if you still cannot come up with a satisfactory fitting, I would be happy to give it a shot if you share the data. This seems like an interesting histogram and may help improve CatExp too.

Safak

-- Lutfu Safak Yilmaz, Ph.D. Associate Professor

Department of Systems Biology University of Massachusetts Chan Medical School 368 Plantation Street, AS5.1057 Worcester MA, 01605-2324 Tel: (508) 856-6711

From: kamoors @.> Sent: Friday, October 14, 2022 11:15 AM To: WalhoutLab/MERGE @.> Cc: Subscribed @.***> Subject: [WalhoutLab/MERGE] CatExp plots (Issue #5)

Hi! me again...

I am now trying to use iMat++ for a completely different model (not iCEL, not Recon), and I'm currently using the CatExp.py stuff to categorize my gene expression data. However, upon using the gaussian curve plotting to obtain the thresholding values, I encounter some weird stuff.. Firstly, I cannot really grasp how to define the expectedStats values, and how exactly they affect the fitting (thus I'm leaving them as in your walkthrough) Second, when I do trimodal, I get a good R^2 value, but the mu values are not in the correct order (why?) - the mu1 matches the mean of the curve most to the right, mu2 matches the one on the left (which is negative?), and mu3 matches the curve in the middle. Is this normal?

Thus, later on I defined the rare cutoff as mu2, low cutoff as mu3, and high cutoff as mu1 - correct? [trimodal]https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F35234605%2F195881190-19e61072-ef77-4fc6-a9b4-3018a68d54ed.png&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7Caaa86a6e7e0748c6656408daadf6e8b6%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638013573224323381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YqHdN6Gdalf0WWoDs5zyRM8w4sdAHJHwMp1c23%2BozB8%3D&reserved=0 {'mu1': 5.631200169345438, 'sigma1': 1.9850268078063493, 'mu2': 2.585393029325549, 'sigma2': 0.20107748436559067, 'mu3': 4.411776037922199, 'sigma3': 1.0847997002620264, 'N1': 2296.8310628226477, 'N3': 1795.7250793209037, 'N2': -50.55614214355137, 'Rsquared': 0.9960556525960251, 'RSS': 15.82686593199582, 'n': 100}

Thanks in advance :)

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FWalhoutLab%2FMERGE%2Fissues%2F5&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7Caaa86a6e7e0748c6656408daadf6e8b6%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638013573224323381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cc3yawTBlj%2FAW9JmYGzxTMyJ5k1hmWcAV%2Fb2MaqSxCo%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA3DYFIQBUPGDIPVMXWVV2DWDF2IPANCNFSM6AAAAAARFKV4VI&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7Caaa86a6e7e0748c6656408daadf6e8b6%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638013573224479657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jMCpevZRDgmIYM5eN62KQ0HD1RNC5%2Bx9chm0GZFSLss%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

kamoors commented 2 years ago

Thank you so much for the explanation!

Maybe you can quickly answer this as well - in the walkthrough, for the C. elegans and human calculations, the thresholds you use are different, i.e.: cel: 2**np.array([Stats_cel['mu1'],Stats_cel['mu1']+Stats_cel['sigma1'],Stats_cel['mu2']])

recon: 2**np.array([Stats_hsa['mu1']-Stats_hsa['sigma1'],Stats_hsa['mu1'],Stats_hsa['mu3']])

In the first case, you use mu1 + sd1 for the low threshold (why not mu2 - sd2?). In the second case you just use mu1 - sd1 for the rare threshold, and again mu1 for the low threshold (seemingly ignoring mu2 altogether). Are these choices made from observation of the graph, or is there some data-driven basis for these decisions?

I will ask about sharing the data - this is a part of a collaboration that should be in the publishing phases soon.. So I don't know whether that is possible right now, but afterwards I think it shouldn't be a problem.

WalhoutLab commented 2 years ago

These are choices made based on the inspection of the fitted distributions.

CatExp was designed to serve two purposes. One is to facilitate the determination of thresholds based on histogram's shape rather than arbitrary cutoffs like high expression=top25%. The other is to use these thresholds to obtain the four sets of genes in each condition as input for MERGE.

Your question relates to the first goal. Thresholds in the examples in your question are based on the number and position of sub-populations and their spread, overlaps with each other etc.

Safak

-- Lutfu Safak Yilmaz, Ph.D. Associate Professor

Department of Systems Biology University of Massachusetts Chan Medical School 368 Plantation Street, AS5.1057 Worcester MA, 01605-2324 Tel: (508) 856-6711

From: kamoors @.> Sent: Friday, October 14, 2022 12:32 PM To: WalhoutLab/MERGE @.> Cc: Yilmaz, Lutfu S (Safak) @.>; Comment @.> Subject: Re: [WalhoutLab/MERGE] CatExp plots (Issue #5)

Thank you so much for the explanation!

Maybe you can quickly answer this as well - in the walkthrough, for the C. elegans and human calculations, the thresholds you use are different, i.e.: cel: 2**np.array([Stats_cel['mu1'],Stats_cel['mu1']+Stats_cel['sigma1'],Stats_cel['mu2']])

recon: 2**np.array([Stats_hsa['mu1']-Stats_hsa['sigma1'],Stats_hsa['mu1'],Stats_hsa['mu3']])

In the first case, you use mu1 + sd1 for the low threshold (why not mu2 - sd2?). In the second case you just use mu1 - sd1 for the rare threshold, and again mu1 for the low threshold (seemingly ignoring mu2 altogether). Are these choices made from observation of the graph, or is there some data-driven basis for these decisions?

I will ask about sharing the data - this is a part of a collaboration that should be in the publishing phases soon.. So I don't know whether that is possible right now, but afterwards I think it shouldn't be a problem.

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FWalhoutLab%2FMERGE%2Fissues%2F5%23issuecomment-1279222608&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7Cea93c17127d0475bbeca08daae01abf1%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638013619448558514%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Y4vLN3ICMY0KP5ossAefIQqulCSpLAZTK%2BgIfLrKbFk%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA3DYFP2UTKHXI2ZJXZLTR3WDGDJNANCNFSM6AAAAAARFKV4VI&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7Cea93c17127d0475bbeca08daae01abf1%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638013619448558514%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CAgkKthXULiYwrjsYFeYDJ1sYorEv4M3Kn2g2E%2FCwDQ%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

kamoors commented 2 years ago

iMAT++_gihub Yet another interesting case.. I cannot seem to get a good fit (probably because of the plateau)

Maybe you have experience with getting data that looks like this? I tried trimodal, but obsiouly the left-hand side cannot be fitted with one or two curves..

Karlis

WalhoutLab commented 2 years ago

Hi Karlis,

Yes this is a strange distribution that I have never seen.

The right (high) subpopulation seems in the right position, though. I would still use its peak for the Highly expressed threshold. However, I would also try triple-curve fitting instead of just two*.

A logical threshold for Low expression could be where the high subpopulation is diminished and there is a clear drop in actual frequency (around x=4.5 I think).

Rare has to be arbitrary, as the left part of the distribution is like a uniform distribution (pretty rectangular), maybe cut it in half or something. It is strange that the data drops to almost nothing exactly after TPM=1 (x=0), maybe the data is processed by some thresholding already?

PS

If you do triple-peak fitting, then from left to right the peaks may represent Rare, Low, and High thresholds, unless the fitting does not make sense (i.e., as in the case of negative areas, too close peaks, etc.) Safak

-- Lutfu Safak Yilmaz, Ph.D. Associate Professor

Department of Systems Biology University of Massachusetts Chan Medical School 368 Plantation Street, AS5.1057 Worcester MA, 01605-2324 Tel: (508) 856-6711

From: kamoors @.> Sent: Tuesday, November 29, 2022 8:54 AM To: WalhoutLab/MERGE @.> Cc: Yilmaz, Lutfu S (Safak) @.>; Comment @.> Subject: Re: [WalhoutLab/MERGE] CatExp plots (Issue #5)

[iMAT++_gihub]https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F35234605%2F204547104-201c1fdc-3932-43bb-98c8-04d4bd34bdef.png&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7C772eb2e75130487ee23a08dad2113518%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638053268596579125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VAKsF0NvZ1oulMSUdTCMCKABaaqE5EOPfKfuKEPLgko%3D&reserved=0 Yet another interesting case.. I cannot seem to get a good fit (probably because of the plateau)

Maybe you have experience with getting data that looks like this?

Karlis

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FWalhoutLab%2FMERGE%2Fissues%2F5%23issuecomment-1330685693&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7C772eb2e75130487ee23a08dad2113518%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638053268596579125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LTkF4RK1e7obUW9cc%2Fb8TAVHwBgdC08Mm4cToGBPNK0%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA3DYFL4XL5QFXJKI3MP7IDWKYDIRANCNFSM6AAAAAARFKV4VI&data=05%7C01%7CLutfuSSafak.Yilmaz%40umassmed.edu%7C772eb2e75130487ee23a08dad2113518%7Cee9155fe2da34378a6c44405faf57b2e%7C0%7C0%7C638053268596579125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=76evSLWEoca9o9eJJHROb%2B6F%2FlN3hk%2F5UvmAiZheDmo%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

kamoors commented 2 years ago

These are the parameters I used -> Stats_cel=trimodal(Texp_cel['ave'],expectedStats=[3,0.5,5,1,7,1]);

And this is the R squared I get -> 'Rsquared': 0.979682592196539

iMAT++_gihub_tri

When I tried to use x=4.5 (instead of x = 5) it cannot solve..

Thank you for all the help you've offered so far. I really appreciate it :)

Karlis

WalhoutLab / MERGE

CatExp plots #5