clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Merge bioresources into REACH #730

Closed enoriega closed 3 years ago

enoriega commented 3 years ago

Every time we make an update to the Bioresources project, we need to bump the version number and publish it to maven central before being able to integrate the change into reach.

Given that reach is the only consumer of bioresources, it makes sense to merge the repositories to streamline the process.

This will be done by migrating bioresources' sbt project as a subproject in reach's, similar export or assembly

MihaiSurdeanu commented 3 years ago

+1!

bgyori commented 3 years ago

Do you think there's a way to preserve the git history of the bioresources repo while migrating? I pretty often look at the history / blame there to figure out what happened when and why so it might be useful. (Maybe something like this: https://medium.com/@leyanlo/how-to-move-one-git-repository-into-a-subdirectory-of-another-with-rebase-2b297b628c57, or making use of https://git-scm.com/book/en/v2/Git-Tools-Submodules).

enoriega commented 3 years ago

I'll take a look. Worst case, we can trace back the changes in the history of the stand alone repo. We need to keep it to support older versions

enoriega commented 3 years ago

I am a little wary of the solution described in https://medium.com/@leyanlo/how-to-move-one-git-repository-into-a-subdirectory-of-another-with-rebase-2b297b628c57 because it involves overwriting git's history, I think it is too risky, but did make a test with the submodules and it works relatively well.

I created branch bioresources_submodule that has the following changes:

Doing this has the least friction in terms of changes, because we get to preserve the same repositories (and each's history too). Bioresources changes would still be done in its own repo. The downside is of a bit of overhead when using git at the moment of the initial clone or when updating bioresources.

After cloning the repo, the user would have to execute:

git submodule init
git submodule update --progress

And to import changes from bioresources, pull with an additional flag:

git pull --recurse-submodules

All of these represents a little extra effort by the end user and may be a source of confusion. The alternative would be to copy the files into the main repository, which will result in a seamlessly transition at the cost of losing the history of bioresources.

@MihaiSurdeanu @kwalcock @bgyori what do you think?

MihaiSurdeanu commented 3 years ago

I agree that it's not worth porting the commit history from bioresources, since bioresources will continue to exist as a repo, and all those git commands scare me. But if @kwalcock thinks differently, I defer to him.

Also, I don't understand what the last 3 commands are for? Once bioresources is included in reach, we will no longer develop anything under the bioresources standalone repo. Am I missing something?

kwalcock commented 3 years ago

I don't know of a better solution offhand. I'd have to be googling. It seems like the thing someone might have written a script for and it would be really nice to be able to run it on a clone first to see what might happen. Did anyone look at things like https://www.nomachetejuggling.com/2011/09/12/moving-one-git-repo-into-another-as-subdirectory/ ?

enoriega commented 3 years ago

@MihaiSurdeanu Git allows you to "embed" another git repository into a main git repository with submodules.

For example, imagine I wrote a cool RL library we want to use in reach but I didn't publish it to maven central. Instead requiring the end user to install the library manually by him/herself you can include the actual sources of another repo as a submodule (instead of copying the files manually) and sbt sees all the files as if they belong to the same project

I did a proof-of-concept of this with bioresources to see how it works, and those commands would be necessary if we were to follow that route.

I think they're not worth the hassle though, as long as we don't need to port the commit history from bioresources.

kwalcock commented 3 years ago

FWIW I've noticed that Gus sometimes uses submodules.

enoriega commented 3 years ago

I don't know of a better solution offhand. I'd have to be googling. It seems like the thing someone might have written a script for and it would be really nice to be able to run it on a clone first to see what might happen. Did anyone look at things like https://www.nomachetejuggling.com/2011/09/12/moving-one-git-repo-into-another-as-subdirectory/ ?

This looks clever and non-destructive. I will test it locally tomorrow to see the result