biocore / American-Gut

American Gut open-access data and IPython notebooks
Other
107 stars 81 forks source link

The way American-Gut repo was intended to be used? #204

Closed iugrina closed 8 years ago

iugrina commented 8 years ago

Hi,

I've been struggling with American-Gut repo and the way I should use it for the past few days. If I understood correctly the repo is broken into a package ('americangut' dir) and auxiliary files. Some of these files are intended to be used by the package itself while others are for interactive sessions with ipython notebooks for example.

In (https://github.com/biocore/American-Gut/issues/199) @JWDebelius recommends installing the package with pip install . -e --no-deps. Therefore, americangut dir indeed was intended to be used as a package. Still, this will not install folders latex and tests from package_data since setup.py seems to be a bit mis-configured (package_data should be a part of src dir of the package).

Also, running (e.g.) 01-get_sequences_and_metadata.md will fail on study_accessions = agenv.get_study_accessions() since it calls get_repository_dir (from results_utils.py) which will strangely take a part of the full path (outside of the package dir) and will try to find 'data' and 'latex' there. Moreover, 'data' isn't even specified in the setup.py.

Therefore, I'm not quite sure how should I use the repo. Should I define PYTHONPATH to include the repo and PATH to include scripts without installing the package or should I install the package (as recommended by @JWDebelius). If I need to install it, what else do I need to adjust to make it work (PATHs, PYTHONPATHs, ...)?

jwdebelius commented 8 years ago

The auxillary files are primarily intended for use in the notebooks. At this point, analysis is wrapped into the notebook. Over the course of the project, there has been an evolution in the best way to call these functions within a notebook (command line utils vs imported functions). There has also been an evolution in the best enviroment and package management approach.

The installation described in #199 is reflective of the current conda install. As far as I can tell from the repeated research, conda doesn't easily support pythonpath modifications. The best suggestion I've seen is modifying a .pth file, which has its own set of challenges. Therefore, its necessary to include a setup.py and install the repository using pip if you wish to have the auxillary code work on the enviroment.

If you're using another environment manager (virtualenv, for instance) which lets you modify the pythonpath, its preferable to modify the path and pythonpath.

iugrina commented 8 years ago

Thank you for the reply.

I've tried it now with conda (instructions from #199) and it still doesn't work. Folders data, latex and tests are not installed as a part of the package (if that was the intention) and running 01-get_sequences_and_metadata.md as an ipython notebook with AG_TESTING=True gives

study_accessions = agenv.get_study_accessions()
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-3-4ce98b7f14da> in <module>()
----> 1 study_accessions = agenv.get_study_accessions()

/home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/notebook_environment.pyc in get_study_accessions()
   2256     """
   2257     if ag.is_test_env():
-> 2258         _stage_test_accessions()
   2259         return _TEST_ACCESSIONS[:]
   2260     else:

/home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/notebook_environment.pyc in _stage_test_accessions()
   2318     sourced from EBI.
   2319     """
-> 2320     repo = get_repository_dir()
   2321     for acc in _TEST_ACCESSIONS:
   2322         src = os.path.join(repo, 'tests/data/%s' % acc)

/home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/results_utils.pyc in get_repository_dir()
     55 
     56     # get_path verifies the existance of these directories
---> 57     get_path(expected, 'data')
     58     get_path(expected, 'latex')
     59 

/home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/results_utils.pyc in get_path(d, f)
     46     """Check and get a path, or throw IOError"""
     47     path = os.path.join(d, f)
---> 48     check_file(path)
     49     return path
     50 

/home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/util.pyc in check_file(f, e)
    146     """Verify a file (or directory) exists"""
    147     if not os.path.exists(f):
--> 148         raise e("Cannot continue! The file %s does not exist!" % f)
    149 
    150 

IOError: Cannot continue! The file /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/data does not exist!

Therefore, IMHO the problem isn't in conda vs pip. Since americangut is installed as a package get_repository_dir will obviously miss the correct repo dir with data/tests/latex folders. The only way I see get_repository_dir finding the correct repo dir is if it is sourced from American-Gut/ameriacngut/results_utils.py (not from the package). However, this way 01-get_sequences_and_metadata.md won't know about it since American-Gut repo isn't in the PYTHONPATH and therefore it will import the package version.

I would like to help with improving this (making it more reproducible, working on different platforms, ...) but I need to know what was the intended way to run it. An example from scratch would help a lot with comments on following question:

wasade commented 8 years ago

Thanks, Ivo. Data and latex are intended to be part of the repo. I recommend looking at what is done via travis.yml. I admit, our internal uses just clone the repo so having setup.py is a bit confusing. However, we'd be excited to see install/deploy improve On Mar 10, 2016 12:39 PM, "Ivo Ugrina" notifications@github.com wrote:

Thank you for the reply.

I've tried it now with conda (instructions from #199 https://github.com/biocore/American-Gut/issues/199) and it still doesn't work. Folders data, latex and tests are not installed as a part of the package (if that was the intention) and running 01-get_sequences_and_metadata.md as an ipython notebook with AG_TESTING=True gives

study_accessions = agenv.get_study_accessions()

IOError Traceback (most recent call last)

in () ----> 1 study_accessions = agenv.get_study_accessions() /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/notebook_environment.pyc in get_study_accessions() 2256 """ 2257 if ag.is_test_env(): -> 2258 _stage_test_accessions() 2259 return _TEST_ACCESSIONS[:] 2260 else: /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/notebook_environment.pyc in _stage_test_accessions() 2318 sourced from EBI. 2319 """ -> 2320 repo = get_repository_dir() 2321 for acc in _TEST_ACCESSIONS: 2322 src = os.path.join(repo, 'tests/data/%s' % acc) /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/results_utils.pyc in get_repository_dir() 55 56 # get_path verifies the existance of these directories ---> 57 get_path(expected, 'data') 58 get_path(expected, 'latex') 59 /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/results_utils.pyc in get_path(d, f) 46 """Check and get a path, or throw IOError""" 47 path = os.path.join(d, f) ---> 48 check_file(path) 49 return path 50 /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/americangut/util.pyc in check_file(f, e) 146 """Verify a file (or directory) exists""" 147 if not os.path.exists(f): --> 148 raise e("Cannot continue! The file %s does not exist!" % f) 149 150 IOError: Cannot continue! The file /home/iugrina/miniconda2/envs/americangut/lib/python2.7/site-packages/data does not exist! Therefore, IMHO the problem isn't in conda vs pip. Since americangut is installed as a package get_repository_dir will obviously miss the correct repo dir with data/tests/latex folders. The only way I see get_repository_dir finding the correct repo dir is if it is sourced from American-Gut/ameriacngut/results_utils.py (not from the package). However, this way 01-get_sequences_and_metadata.md won't know about it since American-Gut repo isn't in the PYTHONPATH and therefore it will import the package version. I would like to help with improving this (making it more reproducible, working on different platforms, ...) but I need to know what was the intended way to run it. An example from scratch would help a lot with comments on following question: - Are data, latex and tests folders intended to be a part of the package or just a part of the repo? — Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/issues/204#issuecomment-195036851 .
iugrina commented 8 years ago

Thanks. If it is intended to be used only as a repo then adjusting PYTHONPATH and PATH should be enough.

iugrina commented 8 years ago

Resolved with #211