just from gradient flow considerations we actually have two almost identical modules, but with inputs being swapped h_re and h_share have gradients from upper layers for semantically different tasks/losses. Besides of that, corrected variant learns slightly better according to my experiments.
Here
https://github.com/Coopercoppers/PFN/blob/6173b3e6b048d1307766ee5d2f8178b30d6675b2/PFN-nested/model/pfn.py#L271
we see
i think it should be
just from gradient flow considerations we actually have two almost identical modules, but with inputs being swapped h_re and h_share have gradients from upper layers for semantically different tasks/losses. Besides of that, corrected variant learns slightly better according to my experiments.