MechMicroMan / DefDAP

A python library for correlating EBSD and HRDIC data
Apache License 2.0
36 stars 18 forks source link

Move sample data to a repository #62

Open JQFonseca opened 4 years ago

JQFonseca commented 4 years ago

I installed DefDap on a different computer today and it took so long tod download everything, primarily because the example data (which is needed) is relatively large. Could I suggest we move it to a repository, like Zenodo and then have a command to download it in the example notebook?

rhysgt commented 4 years ago

I believe the data that we use in the example notebook is now contained within the tests folder and is small compared to the old example data. The old example data is still in the repo in the example_data folder though and I don't think we use it (@mikesmic can confirm). If we remove that, the repo will be ~20mb which I think is reasonable.

Were you downloading though GitHub Desktop by the way? I have found that to be very slow for some reason which is not directly related to repo size. It sometimes takes a long time on a fast connection.

mikesmic commented 4 years ago

The example notebook now only using the data from the tests directory which contains 8.7MB of data - 4.9MP is a ctf file which we should maybe make smaller as it's not used in the example notebook. I will delete the example data directory in develop (I thought I had already done this tbh) which will cut out 36MB (60-70% of the total size)

It would be great to work towards having a library of example datasets, defined with consistent filenames and formats to automatically pull into a notebook.

mikesmic commented 4 years ago

This is still an issue. Cloning downloads 321.29 MiB of data. What's being downloaded? Does cloning include the whole history? Any ideas @merrygoat ?

merrygoat commented 4 years ago

Yes, the hidden .git folder has all of the historical diffs - you should be able to check this by doing a shallow clone:

git clone -–depth [depth] [remote-url]

Where depth is the number of diffs to fetch.

You can use git filter-branch to edit history and remove the files but do be careful, editing the history is slightly dangerous. Best to do locally first and ensure you are happy before a force push.

aplowman commented 4 years ago

Publishing on PyPI is a better approach than fiddling with the git history, isn't it?

merrygoat commented 4 years ago

I didn't think of it like that, but yes certainly.

mikesmic commented 4 years ago

I had a look at finding big files and deleting them from the history, I found a decent guide (https://web.archive.org/web/20190207210108/http://stevelorek.com/how-to-shrink-a-git-repository.html) but it scares me. I will publish to PyPI for now, it's daft that I haven't done that yet