fatiando / rockhound

NOTICE: This library is no longer being developed. Use Ensaio instead (https://www.fatiando.org/ensaio). -- Download geophysical models/datasets and load them in Python
BSD 3-Clause "New" or "Revised" License
34 stars 15 forks source link

Mc murry mess #46

Closed JustinGOSSES closed 2 years ago

JustinGOSSES commented 5 years ago

name: preprocessed McMurray facies dataset about: requesting the addition of this as a new dataset

This is a dataset with facies and well log curve data from the McMurray & Wabiskaw formations in Alberta, Canada. More information about the processed dataset's history can be found here & information about the original dataset here:


Desired dataset/model:


This is to add code for a facies dataset from the McMurray formation in Alberta, Canada.

I'll do some more work on this pull request to double check format & style as the make format command wasn't able to find Black.

I also need to think about tests

In the requirements.txt I had to add tables as a dependencies as I use a zipped h5 format file and Pandas needs tables to open that type of file.

Fixes #

Reminders

follows the style guide.

Do not merge this pull request yet. Mostly adding for visibility and to avoid problems when merge is pressed.

welcome[bot] commented 5 years ago

💖 Thanks for opening this pull request! 💖

Please make sure you read our Contributing Guide and abide by our Code of Conduct.

A few things to keep in mind:

JustinGOSSES commented 5 years ago

To load .h5 files into pandas, I need Tables (as it is called in PIP) or PyTables (as it is called in conda) package. In your build pipeline, you load the requirements.txt packages with miniconda. This creates a problem as it tries to load Tables via conda.

Normally, requirements.txt I think of as for PIP and environment.yml for Conda. What would you suggest doing such that your build passes, but someone can also just run requirements.txt with PIP and not have to always use Conda? Or do you mandate people use Conda?

Installing dependencies
===============================================
Capturing dependencies from requirements.txt
Capturing dependencies from requirements-dev.txt
Installing collected dependencies:
pooch>=0.5
xarray
pandas
rasterio
tables
matplotlib
cmocean
cartopy
pytest
pytest-cov
coverage
pylint
flake8
sphinx==1.8.5
sphinx_rtd_theme
sphinx-gallery
numpydoc
twine
codecov
Collecting package metadata: ...working... done
Solving environment: ...working... failed

PackagesNotFoundError: The following packages are not available from current channels:

  - tables
JustinGOSSES commented 5 years ago

I seem to be blocked by the fact that to load HDFS into Pandas, you need tables/pyTables as a dependency. It is called tables in PIP and pytables in Conda.

This is causing a problem as you're install requirements.txt using both pip and conda in the build pipelines.

leouieda commented 5 years ago

@JustinGOSSES sorry for the delay. These differences between conda-forge and pypi have caused other headaches in the past. We can't just use the environment.yml file cause it sets the Python version adn we want to test in multiple versions on CI. We can't use pip on CI because of troublesome dependencies. So we're in a bit of a bind.

Currently:

This last one is gonna cause problems because I started reading in the requirements from requirements.txt to avoid duplication. Clearly this isn't going to work anymore.

I can see 2 ways forward:

  1. List pip dependencies explicitly in setup.py and conda dependencies in requirements.txt
  2. Not use hdf5 to store the data and avoid the extra dependency on pytables

So my question is: do we really need to have the data in hdf5? Is it much smaller than xz compressed csv? Is there another binary format that would avoid those extra dependencies?

JustinGOSSES commented 5 years ago

Thanks for explanation. I'll try switching it to a xz compressed csv.

JustinGOSSES commented 5 years ago

switched, passes checks!

JustinGOSSES commented 5 years ago

Responded to suggestions that are not inline comments:

Edited docstrings for clarity for users not familiar with well logs.