fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
620 stars 74 forks source link

Inconsistent file hashes between operating systems #410

Closed AKuederle closed 4 months ago

AKuederle commented 5 months ago

I am using pooch to create a data registry for all files in this folder (https://github.com/mobilise-d/mobgap/tree/main/example_data) using pooch.make_registry(str(HERE / "example_data"), str(REGISTRY_PATH)).

This worked perfectly, but one of my colleagues noticed that the registry hashes change, if she reruns the registry creation on her machine. Main difference is that she is running Windows and I am running Linux on my machine.

Proper replication is a little difficult, as I don't have access to a Windows machine at the moment, but I could replicate it using the Github CI. When using a Windows runner, I get different hashes.

I wanted to ask, if that is expected behavior (I assume not) and if this is a known issue (I could not find any other issues mentioning it).

Something, that I hasn't tested yet: it this might be related to the way Git handles line endings, so that on the Windows machine, the files are actually different.

System Information

AKuederle commented 5 months ago

Ok I can confirm that it is a line-ending issue.

Applying the following change to pooch (https://github.com/fatiando/pooch/compare/main...AKuederle:pooch:main) results in consistent file hashes, BUT it also resulted in changes of the hashes on Linux, as it replaced accidental (?) occurrences the windows line ending-byte pattern in some binary files I had. So a more sophisticated solution might be required.

But in general, it might be a good idea to have an option to regularize line endings, to allow people with different operating systems to collaborate on the hash list with files stored in git.

leouieda commented 5 months ago

Hi @AKuederle thanks for reporting! It's indeed the line endings. We had this problem with our own test data. I'd be very hesitant to add this to Pooch itself since it's a git problem and not necessarily a windows/linux problem. The best workaround is to add a .gitattributes file to your repository like this: https://github.com/fatiando/pooch/blob/main/.gitattributes This way, the line endings of the data files are consistent between operating systems.

Could you try this out and let me know if it works?

AKuederle commented 4 months ago

Thanks for the fast response! Adapting the gitattributes fixed the issue as suggested :)

Seems like a proper solution for me. Closing this