elizabethmcd / metabolisHMM

Tool for constructing phylogenies and summarizing metabolic characteristics based on curated and custom profile HMMs
GNU General Public License v3.0
17 stars 5 forks source link

Create a conda release #45

Open vinisalazar opened 4 years ago

vinisalazar commented 4 years ago

Hi @elizabethmcd ,

thank you for providing this package. I'm excited to use it.

I believe providing a conda release for it would greatly improve it and attract users. I'm attempting one myself for Bioconda over here and would like to kindly invite you to collaborate, if you are keen :)

I'm not very experienced with conda so your help as the main developer would be invaluable.

Thank you for any assistance you can provide,

V

elizabethmcd commented 4 years ago

Hi, I don't have much experience with making conda packages, although I do agree it would be useful and alleviate some installation issues. I can try to look into it over the coming weeks.

elizabethmcd commented 4 years ago

Ah I see that you already started the process and got really far with it! I am wondering how when I make updates/bug fixes how to then fix the conda version with updates, as updating the pip version is relatively easy. Thanks for moving forward with this, I think it will be useful for the community!

vinisalazar commented 4 years ago

Hi,

Glad you find it useful.

As far as making updates, whenever a version is released here on GitHub, it would be necessary to make a PR to the meta.yaml in the bioconda-recipe repository:

Unless a requirement is added, then the requirements section must also be updated.

Best, V

vinisalazar commented 4 years ago

I've noticed that the PyPI version is more updated than the version here on GitHub, so I'll modify the URL and sha256 to use the PyPI tarball. Would that be better?

I have a question. Where does the package look for the 'curated_markers' directory? Is it hardcoded into the package path or attached to an environment variable? The maintainers over at @bioconda were concerned with its path, and whether it could be found by the package (please see here).

Thank you for any assistance you can provide,

V

elizabethmcd commented 4 years ago

Yes, I believe the most stable release is the version through PyPi and not through the "official release" section here on Github. I haven't made another official release as I'm still fixing a bug here and there and wanted to wait until those were done, but have still been updating the version on pyPi.

To get the curated markers, you have to manually download them from here: https://github.com/elizabethmcd/metabolisHMM/releases/download/v1.9/metabolisHMM_markers_v1.9.tgz. And then usually they had to be within the directory that you were running the specific metabolisHMM program. I'm not sure how that would work with a conda release, as its looking for it in that directory. I would maybe have to create an argument where the user can decide where the curated markers are stored.

vinisalazar commented 4 years ago

I see.

I noticed that the GitHub release is bundled with the marker files but the PyPI is not. Therefore, it would be 'easier' to use the GitHub release rather than the PyPI. But if the PyPI is more stable, it's probably best to use it and manually download the marker files. This step can be described in the build.sh file of the conda recipe. Alternatively, since the marker file is relatively small, would you consider bundling it along with the PyPI package?

And then usually they had to be within the directory that you were running the specific metabolisHMM program. I'm not sure how that would work with a conda release, as its looking for it in that directory. I would maybe have to create an argument where the user can decide where the curated markers are stored.

I'm aware it would require some work to implement this, but I believe it would be a best practice. If I understand correctly, this would cause complications in running metabolisHMM anywhere in the machine, as it would require my curated_markers/ directory to be there, is this right?

I believe the easiest way to do this would be to simply point the function calls to a variable and have this variable set as default to the curated_markers/ relative path within the metabolisHMM package directory (at the same level of the bin/ directory). And, if the user wishes to change it, simply alter the variable and reinstall the package. This way this is confined to the scope of the Python package and doesn't require exporting environment variables (although maybe the folks at @bioconda have a better solution)

Hope you find the suggestions useful.

V

elizabethmcd commented 4 years ago

Yes, they are relatively small files and could be built with the pypi package in like a data folder so that it's just already in the users path no matter where they are I think. I would have to do a little tinkering to get that up and going. I think GTDB-tk has data files that you download manually because they are so large, whereas something like antismash I think has a command that you download the HMM files but they are put in the path after you install with conda: https://docs.antismash.secondarymetabolites.org/install/ Somehow they have a command download-antismash-database after the conda package/environment is built. That might be another possibility.

vinisalazar commented 4 years ago

Yes, for the GTDB-Tk it is definitely impracticable to have them bundled with the package. I noticed the marker files compress really well (3 MB in the .tar.gz and 18MB uncompressed), so I'd suggest bundling them compressed, but either way is good.

I think it is possible to add the uncompress step in the setup.py file. So the user could download the package with the tar.gz, and the data would uncompress upon installing. This would also eliminate the uncompress step in the build.sh of the conda recipe, simplifying the build for both PyPI and conda.

I'll leave it to you to figure out how you want to implement this, but let me know if I can help in any way.

Edit: I thought it would be better to wait to implement this before changing the recipe's tarball to PyPI. So I'll try and publish v1.9 for now and can update it when the other one is out.

Best, V

marcelm commented 4 years ago

Hi, I’m the one who was reviewing the Conda recipe over at Bioconda. I would also recommend distributing the marker files as part of the PyPI distribution. It would be easiest to store them uncompressed since then there’s no additional unpacking step. The PyPI distribution is compressed anyway, so having a compressed archive within it won’t gain anything. Also, if you ever need to update individual files, you don’t have to replace the entire .tar.gz file in the Git repository, but only the affected files. This keeps the repository smaller. (By the way, the tarball contains a couple of macOS-specific files starting with ._. You may want to avoid adding them to the repo.)

Just because I’ve recently done this in one of my projects, I thought I’d mention how you could practically go about doing this. Feel free to ignore this suggestion.

First, you would need to rename the bin directory. Since it contains your package (it has the __init__.py file), it should have the same name as the package, that is, it should be named metabolisHMM (or metabolishmm if you want to be a bit more consistent with other Python modules that usually use lowercase names). Optionally, you can also put it into a src directory. In short, I suggest you do mkdir src && git mv bin src/metabolisHMM). If you decide to use the src layout, you also need to use package_dir={"": "src"} and packages=find_packages("src") in your setup.py.

If you then move your data files into a subdirectory of the metabolisHMM directory, you can access them from your script with the functions in the importlib_resources module. It’s been part of the standard library since Python 3.7, and on Python 3.6, it can be installed from PyPI. See its documentation.

Next, you would need to ensure that the files are actually included in the PyPI package by adding a MANIFEST.in file that contains the line graft src/metabolisHMM/curated_markers.

Finally, you may need to add include_package_data=True to your setup() parameters, but I’m not sure about that.

vinisalazar commented 4 years ago

This is great, thanks @marcelm

elizabethmcd commented 4 years ago

Thanks for these suggestions @marcelm , they are really helpful. I'll try to get around to making these changes and some other small fixes within a week or so. The only thing I can foresee taking a bit is the importlib_resources module just because I'm not familiar with it, but will look at the documentation to see if I can figure that out.

marcelm commented 4 years ago

Note that the importlib.resources may not be the way to go if you want users to be able to easily change the HMM files. Then using an environment variable and/or command-line option would indeed be a better option. But in any case, the advice about renaming the bin directory still holds independent of that.