Closed hupili closed 10 years ago
Lee Chau Wai Wong
根據在下對 http://bit.ly/1riabfV 源代碼的研究,作者將每項投票全體議員的選擇作了「居中」(center) (in [37]),即每項投票,所有議員的選擇加起來為零。更數學的表述為:
假設第 i 位議員對 j 項議案的選擇是 x(i,j),而 x(i,j) 根據作者的模型可取 {-1, 0, 1} 三值中任意一值,則居中後第 i 位議員對 j 項議案的選擇將變成
x'(i,j) = x(i,j) - [average of x(i,j)'s over all i's]
可見,在第 j 項議案中,第 i 為議員的居中後的選擇的絕對數值,將與其他議員的選擇有關(但相對數值則保持不變)。
例如,假設有3位議員,在 j 項議案的投票選擇分別是:
x ( 1, j ) = -1, x ( 2, j ) = 0, 及 x ( 3, j ) = -1.
則居中後的選擇為
x' ( 1, j ) = -1 - (-2/3) = -1/3, x' ( 2, j ) = 0 - (-2/3) = 2/3, 及 x' ( 3, j ) = -1 - (-2/3) = -1/3.
對 j 項議案的居中後的投票結果,求和得零。
作者將 x'(i,j) 作為計算 PCA 投影 (projection) 結果的原始數據,而未使用真正的原始數據 x(i,j),是直接導致曾鈺成 PCA 結果非零的原因。但這只是偏誤(bias),所有議員均有同一水平的偏誤,並不改變 PCA 結果傳達的訊息。
其實,作者在計算各政治人物 PCA 結果時,如使用未居中的原始數據,則更具有說服力(可從 Karhunen–Loève 理論說明,容略)。這種情況下,曾鈺成的 PCA 結果是零,較為直觀。
Pili Hu:
Lee Chau Wai Wong Thanks for the interpretation! The cause is indeed the centering process, which is well illustrated in the above numerical example. Centering is required for PCA. I have notes in the "Remarks (optional)" section:
No matter whether you approach from maximum variance formulation or minimum error formulation, you will find that centering is the essential step after solving the corresponding optimization problem.
One common (mis-)use of "PCA" is to omit that step, e.g. SVD directly. I wouldn't say that method is incorrect. It is still a valid spectral embedding technique and may give better result in some cases (e.g. in this case it give "better" result because 0 is easier to comprehend). Just that method is not PCA and lose the theoretical guarantee (provided by PCA).
There is a tradeoff between "correct approach" and "correct result" in a practical problem. To make it easier to be accepted, one can omit centering step. Just do not call it PCA. It is OK to call it SVD/ EVD. Or, one can simply do some post-processing to shift the points. Anyway, the output of PCA should not be interpreted as absolute values. Those numbers only have relative meaning.
For original educational purpose, I need to show the "correct approach". That is one extreme. For practical purpose, it's OK to do a tradeoff. In the refinement of this result, we should be careful not to go too far, to the other extreme. Once we reach the other extreme, i.e. purely result-oriented, we are back in the subjective zone and lose the root motivation to do data mining.
Thanks again! It's great to see serious technical discussions!
Harry Mok:
Short answer: This is a common confusion raised by many audiences. To be short, the process is technically correct. There is not calculation "error". The problem is on interpretation. One should not take the output of PCA as absolute meters. Only the relative numbers matters. Some useful discussions are replicated in comments.
Affected commits: 8b95798aa51ac512c4eb8d0ed98cbee12e5257c4
Fixed in: no fix
References: