collectionslab / Omniscribe

9 stars 4 forks source link

repo structure and packaging/distribution #15

Open bertsky opened 4 years ago

bertsky commented 4 years ago

I just stumbled over this project. Excellent work!

IMO you should publish this in an academic paper, since many scholars (in DH and beyond) will be interested...

I have a few suggestions on making this more readily available to users. (And I'd be happy to at least partially volunteer, if you are interested.)

requirements.txt

To maximize utilty for other users, it's necessary to minimize the assumptions/constraints of your explicit dependencies. Currently, that list requires the exact version you happened to have installed when you did pip freeze – but your actual requirements will almost always be broader. Also, it's not necessary to list dependent packages there. E.g. instead of ...

h5py==2.8.0
numpy==1.15.1
keras==2.2.4
scipy==1.1.0

...you could just write keras<2.3.

But most of these are already required by Mask-RCNN anyway (see next point), so perhaps this list could be drastically reduced to just:

mask-rcnn>=2.1
pandas
requests
image

Mask-RCNN redistribution

It's not good practice to dump an open source repo's code into your own, for several reasons:

Instead, you should make mask-rcnn an external dependency, like the others. If you do need to make changes of your own, then do that in a Github fork, which you then can either integrate here as a git submodule or reference in requirements.txt via the GH URL of your fork.

Currently, IIUC, the only change you made is your custom.py and Jupyter notebooks, right?

separate code and data

You (thankfully) provided all your raw and intermediate data files along with the converter scripts and tools, so one can re-produce your work. But for re-use it's usually better to strictly separate data and code files. At least in the directory structure, but preferably also with distinct repos (pointing to the data repo in a submodule only).

distribute on PyPI

The most straightforward and visible distribution option for Python projects is the Python Packaging Index. To publish there, all you have to do is register for an account, bring your repo into shape (following the Python packaging guidelines – most notably, create a setup.py with entry_points and linking to requirements.txt), then use twine to build and upload your package.

kirschbombe commented 4 years ago

@bertsky Wow, thank you for all the feedback! I've been putting off cleaning up this repo and merging in some updates, so this is just the motivation I needed -- and changes I hadn't considered.

I'll begin working on some of these updates, but of course would not say no to any volunteer work :)

bertsky commented 4 years ago

@kirschbombe Splendid!

but of course would not say no to any volunteer work :)

unfortunately I just ran out of GPU computing quota, so I can only commence testing your project in April.

AFAICT the base revision you used is https://github.com/matterport/Mask_RCNN/commit/a49160a3f56afa3a61826f3e6b32ceb1fa528095. There have been a couple of important fixes on master since then. But strictly, one should compare the impact of these on your training+evaluation. I could do that myself (first reproducing the current state of affairs, then moving to mask-rcnn master and repeating everything), but for you it's probably a lot less effort.