A question about 4.7.3.3. Label Shift Correction

I'm confused about the equation $\sumjc{ij}p(y_j)=\mu(\hat y_i)$ and the definition of confusion matrix $C$ above. As I understood, the equation is based on the full probability equation $$\sum_jP(\hat y=y_i|y=y_j)P(y=y_j)=P(\hat y=y_i)$$ where $\hat{y}$ stands for the predicted label of $x$ and $y$ stands for the true label of $x$. To link the two equation together, I got $P(\hat y=y_i)$ is equal to $\mu(\hat y_i)$ and $P(y=y_j)$ is equal to $p(yj)$. So the confusion matrix element $c{ij}$ need to be the conditional probability, while according to the definition above, the $c_{ij}$ is actually a joint probability drawn from training distribution. My question is

Am I thinking wrong?
or are we using the joint probability to calculate the target label distribution approximately while never precisely?

Looking forward to your reply!

d2l-ai / d2l-en

A question about 4.7.3.3. Label Shift Correction #2560