Mc murry mess - Githubissues

JustinGOSSES commented 5 years ago

name: preprocessed McMurray facies dataset about: requesting the addition of this as a new dataset

This is a dataset with facies and well log curve data from the McMurray & Wabiskaw formations in Alberta, Canada. More information about the processed dataset's history can be found here & information about the original dataset here:

Desired dataset/model:

Model/dataset name: mcmurray_facies_dataframe.h5.zip
Download link: https://github.com/JustinGOSSES/McMurray-Wabiskaw-preprocessed-datasets/raw/master/processed_datasets/mcmurray_facies_dataframe.h5.zip
License: Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)
Target audience (geophysicists, solid earth, etc): Geologists, Data Scientists
Is it widely used: One of the few, if only dataset, open to the public with this many wells (1280) and both facies from core descriptions and well log curves. The S.E.G. facies prediction contest dataset is the most widely used one currently. It only has 3 wells.

This is to add code for a facies dataset from the McMurray formation in Alberta, Canada.

I'll do some more work on this pull request to double check format & style as the make format command wasn't able to find Black.

I also need to think about tests

In the requirements.txt I had to add tables as a dependencies as I use a zipped h5 format file and Pandas needs tables to open that type of file.

Fixes #

Reminders

[attempted, need to double check ] Run make format and make check to make sure the code

follows the style guide.

[ ] Add tests for new features or tests that would have caught the bug that you're fixing.
[ not yet ] Add new public functions/methods/classes to doc/api/index.rst.
[ partially complete= ] Write detailed docstrings for all functions/methods.
[added to examples folder ] If adding new functionality, add an example to the docstring, gallery, and/or tutorials.

Do not merge this pull request yet. Mostly adding for visibility and to avoid problems when merge is pressed.

welcome[bot] commented 5 years ago

💖 Thanks for opening this pull request! 💖

Please make sure you read our Contributing Guide and abide by our Code of Conduct.

A few things to keep in mind:

Remember to run make format to make sure your code follows our style guide.
If you need help writing tests, take a look at the existing ones for inspiration. If you don't know where to start, let us know and we'll walk you through it.
All new features should be documented. It helps to write the docstrings for your functions/classes before writing the code. This will help you think about your code design and results in better code.
No matter what, we are really grateful that you put in the effort to do this! ⭐

JustinGOSSES commented 5 years ago

To load .h5 files into pandas, I need Tables (as it is called in PIP) or PyTables (as it is called in conda) package. In your build pipeline, you load the requirements.txt packages with miniconda. This creates a problem as it tries to load Tables via conda.

Normally, requirements.txt I think of as for PIP and environment.yml for Conda. What would you suggest doing such that your build passes, but someone can also just run requirements.txt with PIP and not have to always use Conda? Or do you mandate people use Conda?

Installing dependencies
===============================================
Capturing dependencies from requirements.txt
Capturing dependencies from requirements-dev.txt
Installing collected dependencies:
pooch>=0.5
xarray
pandas
rasterio
tables
matplotlib
cmocean
cartopy
pytest
pytest-cov
coverage
pylint
flake8
sphinx==1.8.5
sphinx_rtd_theme
sphinx-gallery
numpydoc
twine
codecov
Collecting package metadata: ...working... done
Solving environment: ...working... failed

PackagesNotFoundError: The following packages are not available from current channels:

  - tables

JustinGOSSES commented 5 years ago

I seem to be blocked by the fact that to load HDFS into Pandas, you need tables/pyTables as a dependency. It is called tables in PIP and pytables in Conda.

This is causing a problem as you're install requirements.txt using both pip and conda in the build pipelines.

leouieda commented 5 years ago

@JustinGOSSES sorry for the delay. These differences between conda-forge and pypi have caused other headaches in the past. We can't just use the environment.yml file cause it sets the Python version adn we want to test in multiple versions on CI. We can't use pip on CI because of troublesome dependencies. So we're in a bit of a bind.

Currently:

environment.yml is used to setup a development environment.
requirements.txt and requirements-dev.txt are used to install conda packages on CI
setup.py install_requires is used to set pip dependencies

This last one is gonna cause problems because I started reading in the requirements from requirements.txt to avoid duplication. Clearly this isn't going to work anymore.

I can see 2 ways forward:

List pip dependencies explicitly in setup.py and conda dependencies in requirements.txt
Not use hdf5 to store the data and avoid the extra dependency on pytables

So my question is: do we really need to have the data in hdf5? Is it much smaller than xz compressed csv? Is there another binary format that would avoid those extra dependencies?

JustinGOSSES commented 5 years ago

Thanks for explanation. I'll try switching it to a xz compressed csv.

JustinGOSSES commented 5 years ago

switched, passes checks!

JustinGOSSES commented 5 years ago

Responded to suggestions that are not inline comments:

Deleted the .DS_Store files and added to .gitignore
Added the fetch_mcmurray_facies call to doc/api/index.rst
Added the Wynne et al., citation to the doc/references.rst

Edited docstrings for clarity for users not familiar with well logs.

fatiando / rockhound

Mc murry mess #46