erdogant / bnlearn

Python library for learning the graphical structure of Bayesian networks, parameter learning, inference and sampling methods.
https://erdogant.github.io/bnlearn
Other
463 stars 45 forks source link

Working with continuous expression data? #94

Open stevenagl12 opened 6 months ago

stevenagl12 commented 6 months ago

I have a potentially dumb question. So, as I understand it, we need to discretize the data to work with this package on continuous biological data, such as gene expression or cytometry data. The inbuilt function for bn. discretize however takes in a build graph as an input though. With our data, we can't infer which nodes and edges we have to start a random graph. How can we use this package with such continuous data? As I understand it, in the R bnlearn library, it came with the iamb, and hatermink discretization options, but I don't see that in this package.

erdogant commented 6 months ago

When you only have data, and want to start without a structure, try the structure learning. However the methods in bnlearn does require data to be discrete.

Two suggestions how to approach this:

  1. Discritize your data based on your domein knowledge and/or in combination with other statistics. For example, for your gene expression profiles you could do a t-test between a control group and set a threshold (alpha is 0.05) with or without multiple test correction. This would return three states for each gene (up, baseline, down). If you dont have a control group, try fitting the distribution to a theoretical distribution (checkout distfit) and make a cut on the 95%CII or so. Do both sides of the distribution and you would again have three states per gene. This comes close to constrain based: https://erdogant.github.io/bnlearn/pages/html/Structure%20learning.html#constraint-based

  2. Try using the built on functionality of bnearn to automatically discritize and create states based on the continuous expression profiles. This is again a starting point towards structure learning. See documentation for more details.

https://erdogant.github.io/bnlearn/pages/html/Continuous%20Data.html

No methods like iamb. However, checkout what’s available is pgmpy. If there is something what could help you, I am open to merge commits.

Asking questions makes you smart btw. Keep it up 👍🏻

stevenagl12 commented 6 months ago

So, while I understand the first part, I was wondering about the second. Using that discretize function it takes the argument for DAG. This DAG in the example is created by priors of the connections between the variables. How do we create one without knowing what variables might be connected?

On Wed, Feb 21, 2024, 10:30 AM Erdogan @.***> wrote:

When you only have data, and want to start without a structure, try the structure learning. However the methods in bnlearn does require data to be discrete.

Two suggestions how to approach this:

1.

Discritize your data based on your domein knowledge and/or in combination with other statistics. For example, for your gene expression profiles you could do a t-test between a control group and set a threshold (alpha is 0.05) with or without multiple test correction. This would return three states for each gene (up, baseline, down). If you dont have a control group, try fitting the distribution to a theoretical distribution and make a cut on the 95%CII or so. Do both sides of the distribution and you would again have three states per gene. 2.

Try using the built on functionality of bnearn to automatically discritize and create states based on the continuous expression profiles. This is again a starting point towards structure learning. See documentation for more details.

https://erdogant.github.io/bnlearn/pages/html/Continuous%20Data.html https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ferdogant.github.io%2Fbnlearn%2Fpages%2Fhtml%2FContinuous%2520Data.html&data=05%7C02%7Csalewis%40g-mail.buffalo.edu%7C22398428127f4cc2e56708dc32f20c75%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638441262383142903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=N0E28utjYRE%2BfHgAxU9%2ByW4xifn7NvLSMCZFz1%2Fkj84%3D&reserved=0

Asking questions makes you smart btw. Keep it up 👍🏻

— Reply to this email directly, view it on GitHub https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ferdogant%2Fbnlearn%2Fissues%2F94%23issuecomment-1956955962&data=05%7C02%7Csalewis%40g-mail.buffalo.edu%7C22398428127f4cc2e56708dc32f20c75%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638441262383142903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=KvtJCA5TF5he7U8qteRhqO5aJ21m%2FzU3r1qP%2BSGATEg%3D&reserved=0, or unsubscribe https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHHKCLJNSQQ2COY45GBIDZDYUYHJVAVCNFSM6AAAAABDP7Z36CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWHE2TKOJWGI&data=05%7C02%7Csalewis%40g-mail.buffalo.edu%7C22398428127f4cc2e56708dc32f20c75%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638441262383299149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=5MQttHvvakcCWVX41hiMmc3ZNUSG6r8NsUgQODe5CYw%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***>

erdogant commented 6 months ago

You are right. The second part does need a DAG at start. Unfortunately there is no other implementation yet.

akshatakarjun commented 1 month ago

Hi,

By continuous biological data, did you mean continuous data like various numbers (for ex 103.2, 102, 99, 2.5, etc) or time-series data? If it ain't any of these, could you please explain what the data you have mentioned, loos like?

Also, if it is different, is this package applicable fr continuous data like the one I have mentioned above?

stevenagl12 commented 1 month ago

I was talking about various numbers of RNAseq fold changes.

erdogant commented 1 month ago

If you would like to know some comparison with other causal packages, you can read it in my blog over here. The last time I checked, only CausalImpact can model continuous values but that is for time series data. So, it is not applicable when you are using RNAseq data.

Loominarty commented 1 month ago

I also have a dumb question:

I have a dataset that mixes continuous and discrete data. I noticed the bn.discretize function takes a lot of time (my dataset is 11000 points roughly, 9 columns, among which 4 are continuous).
Is there a possibility to discretize outside of bnlearn or is this not compatible ?

I tried using the pandas functions to circumvent the issue and generate Interval Indexes in my dataset but with very little success.

akshatakarjun commented 1 month ago

Unsure what kind of continuous data you have but If possible, you can manually put them into a discrete range. For example, if a feature called BloodPressure has various values, then we know what values of BP is considered as normal, high BP and low BP. You can do a if loop, if the value falls in this range, replace all those rows value with the categorical value you want.

Just a thought!!

Loominarty commented 1 month ago

Hi @akshatakarjun ,

I found something that works alright, but is not very convenient in terms of user comfort. I have discretized outside of the library and used bn.df2onehot to encode the indexes into integers. Then I just translate my new incoming data into one of these numbers.

erdogant commented 1 month ago

You can indeed manipulate your data as you wish. The df2onehot was included in bnlearn to provide one of the steps from start-to-results. So you are right, it brings some comfort but at the same time it is generally slow.