Generation of the cost matrix C

I have a question about the code, for the generation of the cost matrix C, why only the ground image features processed by VGG are directly used as the input of the sinkhorn (I mean this input is also the cost matrix), shouldn't the features of the extracted ground image and satellite image do a certain operation to get this cost matrix? Is there a problem with my understanding? Thank you very much

YujiaoShi / cross_view_localization_CVFT

Generation of the cost matrix C #6