dev-cap / MLCAT

Analysis of mailing lists for detecting communication patterns
GNU General Public License v3.0
6 stars 10 forks source link

Move data to separate repository #78

Open prasadtalasila opened 7 years ago

prasadtalasila commented 7 years ago

The data directory is resulting in too large a repository size. Replicating the code base comes a bit of an issue on the integration servers and also for new users who would like to try out our project.

Should we move the data directory out of the repository and provide a configuration file which can have variables pointing to the correct location?

achyudh commented 7 years ago

A while back I had suggested making a separate repository to host our data files. We can place a README in the data folder with instructions and a script that clones and extracts the other data repo.

Please let me know what you think.

prasadtalasila commented 7 years ago

My bad that I did not recognize the problem at that point. What is the best way forward? We can store the data in a separate repo. That's the easy part. The hard part is to carefully shrink and prune this repo by removing the data/ directory completely.

What is the best way forward?

achyudh commented 7 years ago

We can permanently remove the data folder from the repository by using:

git filter-branch --tree-filter 'rm -rf data/' HEAD

This command will iterate through the whole commits history in the repository, change the commit objects and rewrite the entire tree. However, it is best to backup our data before doing so. I am not sure how we can push this to GitHub as we need to ensure all branches and tags are pushed to remote.

achyudh commented 7 years ago

We can try this to force push all branches and tags are pushed to GitHub.

git push origin --force --all git push origin --force --tags

prasadtalasila commented 7 years ago

I tried the following commands on all the branches.

> git filter-branch -f --tree-filter 'rm -rf data/' HEAD
> rm -Rf .git/refs/original
> rm -Rf .git/logs/
> git gc --prune=now

The data/ directory is removed from all the fresh commits of all the branches. I removed the github remote and kept only master, development, gh-pages and java_deprecated branches. But the size is still the same (253MB).

> git gc --aggressive --prune=now

reduced the repo size to 51MB. I think there were a few large objects in lib/data in one of the past commits. I will try remove them and revert.

prasadtalasila commented 7 years ago

The shrunk repository which stands at 51MB is posted at https://github.com/prasadtalasila/MailingListParser-Shrunk

prasadtalasila commented 7 years ago

Update: After performing aggressive garbage collection and pushing the repo to github, I found the efective cloned repository size to be 20MB. This new size is a bit more manageable. What part of code do we need to change in order to cleanly separate data and code? If we can do this with little effort, we can do the separation now.

Do check the shrunk repository to make sure that we are not missing anything crucial project components.

kaivalyar commented 7 years ago

Git submodules are the way to manage such issues - when a software project relies on another project (containing data, or some other dependency).

You can modify this example, to suit your exact needs: Separate the original MailingListParser repo into 2: a new MailingListParser repo and a MailingListData repo. Store the data in the MailingListData repo - this will probably have a large size. The code goes into the new MailingListParser, and is much smaller (and therefore faster to clone)

Now add the MailingListData repo as a submodule to the new MailingListParser.

git now treats the two as separate repos, but tracks the submodule based on a single commit from the MailingListParser project.

This way, there is no need to post instructions in the README along with a script as @achyudhk suggested, and instead, a git submodule update/clone command can be run by users who want to fetch the data.

here is an article from github explaining this more thoroughly https://github.com/blog/2104-working-with-submodules

prasadtalasila commented 7 years ago

Thanks @kaivalyar for suggestion on git submodules. Just to balance the debate, here is an article on the downside of gitsubmodules. Do see the comments of that article as well.

@achyudhk I also notice that the data has both input files (mbox) and output files (png, csv, html, json) files. If we can quickly generate the output files (< 6 hours on server), then we can remove results from the repository completely.

For our scenario, we are not managing any code in data/. They are all completely large data files of mbox, png, json and html files. We may be better off using git-lfs.

Another suggestion is to use BFG repo cleaner instead of git filter-branch. What do you guys say?

kaivalyar commented 7 years ago

Thank you for mentioning git-lfs. I did not know about it earlier. It does seem to suit this particular use case better.

I did some more reading on git submodules, and found that many people have had issues with them. Git introduced the git subtree to address some of those (https://www.atlassian.com/blog/git/alternatives-to-git-submodule-git-subtree).

Finally, however, git-lfs might be the best way to proceed since there is no single data/ folder containing the large files or anything of that sort.

achyudh commented 6 years ago

@prasadtalasila I forgot to mention that we are using the output files in the documentation (used to be Wiki) to summarize our results. Plus these results are for multiple time intervals so it would take time to generate all of them. Can we please keep them in the data repository?

Also we can use the BFG Repo-Cleaner. I looked up the functionality and it is all we need.

prasadtalasila commented 6 years ago

The images / results used in wiki can be moved to a different directory, say data/docs or something else more meaningful.

Any other results that we can generate in less than six hours of machine time (Core i3 processor, 12GB RAM), we can safely remove. Lets start there and see the improvements in the size. If those parts can be clearly identified, we can start with this repo again, remove the identified files and push the changes to shrunk repository.

After all of us review the changes proposed, the changes can be implemented on this repository.

prasadtalasila commented 6 years ago

@achyudhk Just to be safe, please list here the files / directories that you are proposing to remove.

achyudh commented 6 years ago

This is list of files to remove (contains all images and data that we include in the documentation. We can manually move them to doc/ later):

This is a list to transfer to the data-only repo:

prasadtalasila commented 6 years ago

@yashpungaliya you can implement the suggested changes on MailingListParser-Shrunk first. If the changes are yielding good results on that repository, we can bring those changes to this repository.

Instead of manually redoing all of these steps twice, it is better to write a shell script that does what we intend and then cross-check the script. That way, all of us are sure of the changes being applied to the repository.

prasadtalasila commented 6 years ago

@AakankshaSanctis please look at this issue once the work on the PR is complete. You have also been given access to MailingListParser-Shrunk repository to try out different approaches.

AakankshaSanctis commented 6 years ago

@prasadtalasila All the images from the docs and wiki pages are added in a new wiki page

prasadtalasila commented 6 years ago

@AakankshaSanctis What you have done is take the images hyperlinks from GitHub code repo and add those links to a wiki page. Now when the data/ directory is removed from the GitHub code repo, all the image links would be invalid. What was suggested was this.

  1. Clone MLCAT.wiki repository. The wiki URL is https://github.com/DeveloperCAP/MLCAT.wiki.git
  2. Identify all the images being used in wiki pages, Sphynx docs, README.md and project_outline.md.
  3. Copy all the images identified in the previous step into MLCAT.wiki/images directory.
  4. A better approach would be to install git LFS plugin and make it operate on images/ directory. (I don't mind if you don't do this now; but this is the best approach).
  5. Update the links to the images in wiki pages, Sphynx docs, README.md and project_outline.md.
prasadtalasila commented 6 years ago

@AakankshaSanctis Please add the list of files to be changed to this issue.

prasadtalasila commented 6 years ago

The python code files that have hard-coded data path information are:

lib/test_integration.py
lib/util/layout.py
driver_author_analysis.py
driver_headers_mbox.py
driver_thread_analysis.py