cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.
GNU General Public License v2.0
404 stars 110 forks source link

Clarification on the Type of Example Data Used in ./resources/airfoil-self-noise.continuous.txt #1783

Open miaomiao-alt opened 3 months ago

miaomiao-alt commented 3 months ago

I am currently working with the Tetrad package and using the example data file located at ./resources/airfoil-self-noise.continuous.txt. I need some clarification on the nature of this dataset. Could you please let me know if this example data represents cross-sectional data, time series data, or repeated measures data?

Thank you for your assistance!

jdramsey commented 3 months ago

Huh, I happen to be awake. I know that dataset well. It's an experiment. Out of your options, I would choose repeated measures. The experimental variables are set by the experimenter, and then the experiment is run and the other variables measured.

jdramsey commented 3 months ago

Hold on, I can do better. I included that dataset in my online repository of real datasets, here:

https://github.com/cmu-phil/example-causal-datasets/tree/main/real/airfoil-self-noise

It you look in the ReadMe file you'll see a reference to the UCI Repository; if you follow that link...nope nope nope they moved it...

Here's the current UCI repository link:

https://archive.ics.uci.edu/dataset/291/airfoil+self+noise

miaomiao-alt commented 3 months ago

Could you please let me know if all the variables are measured at a single time point, or if these variables are measured multiple times at different time points? For instance, are certain variables measured once on a specific date (e.g., February 1st), or are they measured multiple times on different dates (e.g., February 1st, March 1st, and March 1st)?

jdramsey commented 3 months ago

I don't believe that this is the sort of data that you could usefully model as a time series if that's what you're asking. The experimenter picked values for the experimental variables, set up the apparatus that way, started the wind tunnel, and measured the other variables. There's no reason to think that the values obtained from a later date would be influenced by the values obtained at an earlier date.

I believe this is the original NASA tech report for the experiment if I'm not mistaken. I recognize the font on the title page. The UCI page used to give this reference but has dropped it; I'm not sure why.

https://ntrs.nasa.gov/api/citations/19890016302/downloads/19890016302.pdf

miaomiao-alt commented 3 months ago

I would like to ask you, my research is panel research, can this kind of data be modeled using the model in this package?** Panel research is a method of data collection and analysis that tracks the same set of individuals over multiple time points. Here are its key features and advantages: Data Collection: Longitudinal Tracking: Repeated measurements on the same group of individuals at different time points. Multidimensional Data: Typically involves multiple variables such as economic status, health conditions, behavioral changes, etc. Research Design: Cross-sectional Design: Data at each time point can be viewed as a cross-sectional study. Longitudinal Design: Allows the study of changes and causal relationships over time by analyzing data across multiple time points.

cg09 commented 3 months ago

I am not sure what you mean by "this package." If you refer to the TETRAD software then, yes, easily.

Cg

On Thu, Jun 13, 2024 at 11:20 AM miaomiao-alt @.***> wrote:

I would like to ask you, my research is panel research, can this kind of data be modeled using the model in this package?** Panel research is a method of data collection and analysis that tracks the same set of individuals over multiple time points. Here are its key features and advantages: Data Collection: Longitudinal Tracking: Repeated measurements on the same group of individuals at different time points. Multidimensional Data: Typically involves multiple variables such as economic status, health conditions, behavioral changes, etc. Research Design: Cross-sectional Design: Data at each time point can be viewed as a cross-sectional study. Longitudinal Design: Allows the study of changes and causal relationships over time by analyzing data across multiple time points.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1783#issuecomment-2165988958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3ON32CS6IMZCARA45LLZHG2FPAVCNFSM6AAAAABJEAC5BKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVHE4DQOJVHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

miaomiao-alt commented 3 months ago

What I meant was that I wanted to use RPy-Tetrad to analyze my data (panel). I'm sorry that I didn't make it clear before.

jdramsey commented 3 months ago

Ah, rpy-tetrad! This is doable, though you may need to give me more information. Are there unobserved varaible for instance that may influence the model? Let's say you can assume not, so you can treat the model as causally sufficient, so you can try to search for a CPDAG. You can use, say, BOSS (which is a pretty accurate CPDAG search) to analyze the data. You'll need to define background knowledge to set up "tiers" with the variable at the first point, say, tier 0, and the variable at the second point, tier 1. Let me see if I can find an example in the rpy-tetrad example files. here's one:

https://github.com/cmu-phil/py-tetrad/blob/main/pytetrad/R/sample_r_code4.R

Well, I guess I will use tier 1 and tier 2 here, but that's OK. Tier 0 and Tier 1 would have been more standard; maybe I'll fix that.

This may be a good way to get started, and then you can tell me how far this is from what you need.

Obviously, you're an R person, so you probably know this, but there is an issue in R where data is read in as numeric only if the numbers contain decimal places; if they don't, you'll need to coerce them to be numeric, as in the example:

i <- c(1, 7) data[ , i] <- apply(data[ , i], 2, function(x) as.numeric(x))

This is important, as Tetrad distinguishes continuous from discrete variables and translates numeric R variables as continuous and integer R variables as discrete, so you may end up with a mixture, in which case you'll need to use different tests or scores.

miaomiao-alt commented 3 months ago

If unobserved varaible may affect the model, the causally sufficient is not satisfied. The solution I think is to regression the influence of unobserved varaibles, and then use the boss algorithm for analysis. Is this solution reasonable?

jdramsey commented 3 months ago

The issue is if you don't know what the unobserved variables are. You could use latent variable algorithms. I could give you some recommendations, though the ones that work the best are "experimental" currently. But I would try BOSS first and see if it gives you sensible results. What do you think?

miaomiao-alt commented 3 months ago

I take your advice, trying BOSS first. In addition, I want to identify some of the effects of confounding variables on modelability as unobserved variables on modelability

jdramsey commented 3 months ago

OK! Sorry, this week was busy. Let me go back now and look at your questions.

jdramsey commented 3 months ago

We have some latent variable algorithm to address your one question, which are theoretically correct. I've been concerned for the past several months about accuracy for them. The most accurate among the ones you can use currently from Python or R is BFCI. (Or GRaSP-FCI; they are similar; you could try both.) BFCI uses both a test and a score to work; I have recently put together an entirely score-based LV method called LV-Lite that beats BFCI on a number of measures, though it's not public yet; nor is is published or even written up. I will try to do that so you can use it, but I don't know then that will happen. In the meantime, you can use BFCI, I think.

The output of BFCI is a PAG (Partial Ancestral Graph); if you're not already familiar with PAGs, you may need to do some reading for them so you know how to interpret the outputs.

jdramsey commented 3 months ago

Both BFCI and LV-Lite use BOSS as an initial step. So they are LV algorithms that are extensions of BOSS, which is why I though you should try running BOSS first. In fact, there is another way to proceed, which isn't theoretically correct but is often accurate, which is to run BOSS and then simply report that PAG that the BOSS result belongs to, which can be done in Tetrad. Let me know if you'd like to do that, and I can add support for that to py-tetrad and rpy-tetrad.

jdramsey commented 3 months ago

By the way, I did find some issues in the LV algorithms about use of background knowledge that I've fixed in my github branch. I can put the code into py-tetrad for you to use with the fixes. I say this because you were going to do a temporally tiered analysis.

jdramsey commented 1 month ago

@miaomiao-alt Can I close this issue?