M-Nauta / TCDF

Temporal Causal Discovery Framework (PyTorch): discovering causal relationships between time series
GNU General Public License v3.0
482 stars 106 forks source link

Question on Running TCDF #3

Closed ceesu closed 5 years ago

ceesu commented 5 years ago

Hello, thanks very much for your work on this interesting project. I would like to try TCDF following the 'Finance hidden' example in your paper but I have a few questions. Currently I am working with an independent dataset that I have made.

  1. Whereas in the 'demo_dataset.csv' the column headers are 'Timeser0', etc. the datasets in the Finance folder, 'random-rels_20_1A.csv' for example, do not seem to have headers. Would you be able to explain how these files correspond to 'Finance' / 'Finance hidden/Finance train/Finance test' in the paper?
  2. In section 4.3.2 of the paper, one passage states that 'If there is a measured confounder, PIVM should discover that the confounder’s effects Xi and Xj are just correlated and not causally related'. I am wondering, what about the relationship between the confounders and Xi/Xj? Say I have a situation where I have two measured confounders L1, L2 that affect each other, but also both of which together determine other time series Xi, Xj. a) How can I tell if L1, L2 have been discovered as confounders of the relationship between Xi and Xj? b) will the discovered causal relationships include those between L1, L2 and Xi, Xj?
  3. How will the discovery of the causal relationships be affected by the number of hidden layers specified? Could you comment on, would there be any situation where increasing the number of hidden layers allows discovery of incorrect causal relationships? I am asking because on my dataset, it seems like certain relationships appear only when I use L=1 and not smaller (or larger) values of L.
M-Nauta commented 5 years ago
  1. For the Finance data, the files with 'returns' in the filename are the input datasets. These datasets have a header (S0, S1, S2, etc), just like in the demo. The other files contain the ground truth and indeed do not have a header. This is now clarified in the README.
  2. a) A measured confounder is a common cause that is present in the dataset. To tell whether L1, L2 are discovered as confounders, you can simply analyse the causal graph and see whether there are causal relationships going from L1, L2 to A and B.

    b) TCDF does not give a perfect causal graph, as shown by the experiments in the paper. It is therefore not possible to tell beforehand whether TCDF will discover all causal relationships between L1, L2, A and B.

  3. See section 6.1 of the paper presenting a discussion on the number of layers. In short, it depends on the dataset and it could indeed be the case that TCDF does not give exactly the same results when varying the number of hidden layers (which can also be seen in Table 4 of the paper). Increasing the number of layers in mainly useful when it is expected that the time delay is rather large or when the time series are long. However, because the receptive field is larger with more layers, it might be more challenging for the network to discover the correct patterns.
ceesu commented 5 years ago

Thanks very much for your response. It seems like when I include the measurements of L1, L2 within the dataset, TCDF can correctly identify them as causes; but when I do not, it does not seem to identify that there are hidden confounders despite the fact that the time series are not so long (250 steps). But perhaps it's because my current dataset has many variables.

Given your response I would like to try it with a range of number of layers, as well as perhaps the finance dataset, and see what this does but for this purpose I have small follow up questions:

  1. Is it possible to save the graph output and the text output separately when running the program? It seems like sometimes the graph scale will be such that it is difficult to read the labels and I would like to, for example, resize the figure.
  2. Is it possible to feed in variables that are strings as inputs, for example for the --data parameter? If I do something like the following:

filename="file.csv" %run -i "runTCDF.py" --data filename

It seems the program will try to find the file "filename" instead of "file.csv".

M-Nauta commented 5 years ago

As explained in section 4.3.2, TCDF concludes that there exists a hidden confounder (i.e. not included in the dataset) when it discovers a 2-cycle between the hidden confounder's effects, both with delay 0 (see Fig 8b). TCDF will not draw the hidden confounder as a node in the graph but the user itself can conclude that there should be a confounder. As a side note, our experiments showed that TCDF performs better on long time series (see Table 4) so 250 time steps might be a little short.

Regarding your other questions: 1) Saving the graph output and text output separately is not explicitly supported in the current implementation. However, you could add a line of code in TCDF to do this. For example, add your own code at line 253 in runTCDF.py to write the discovered causal relationships to a file. You can then choose your own plotting library to read this file and visualize the graph. Other visualization toolboxes offer more functionality, such as resizing the figure. 2) I think that's not possible. The hacky way of doing it is to save your filename somewhere as a string and copy-paste it everytime you want to run TCDF ;) However, please note that you can run TCDF on multiple datasets by giving a list of filenames as argument, separated by commas.

ceesu commented 5 years ago

Thanks very much for your reply. I switched to running a 1200 time step dataset. Unfortunately it doesn't seem to discover any 2-cycles in this case either... I will try to use your suggestions for saving the plot and running multiple files though. This is very helpful!