Add Manville well log dataset

JustinGOSSES commented 5 years ago

Description of the desired feature Suggest adding the Mannville Group dataset of 2000+ well logs, tops, and facies labels from Alberta Geological Survey. It also has metadata and a report. This dataset has been used in multiple hackathons. https://ags.aer.ca/publications/SPE_006.html

Different people with different goals might want to load different parts of that dataset. Here are some examples:

(1) Be able to load all data (excluding metadata and report) into a pandas dataframe for whatever purposes people might have (2) Be able to load data that serves as a starter dataset for predicting tops. Different people might have different required curves or different tops, but there should be a starter dataset for that class of problem. For example, Top McMurray as the top and required curves are ['ILD', 'NPHI', 'GR', 'DPHI', 'DEPT']. 3) Be able to load all wells with facies + specific well log curves into a pandas dataframe for facies prediction. Required well log curves might be ['ILD', 'NPHI', 'GR', 'DPHI', 'DEPT'].

Are you willing to help implement and maintain this feature?

Yes. Some useful starting points:

Data is available from original source here: https://ags.aer.ca/publications/SPE_006.html
Data is available in the Predictatops repo under the demo folder with wells that don't load right already taken out and put into a separate folder: https://github.com/JustinGOSSES/predictatops/tree/master/demo/mannville_demo_data
An example of functions for interrogating the data and returning UWIs for wells that only have specific tops or curves can be found in the checkdata.py module of Predictatops.
An example of functions for loading only the UWIs in a given well list can be found in the load.py module of Predictatops.

I'd like to make the data loading for Predictatops integrated with Pooch & Rockhound.

welcome[bot] commented 5 years ago

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.

You might also want to take a look at our Contributing Guide and Code of Conduct.

santisoler commented 5 years ago

Hi @JustinGOSSES! Thanks for opening this issue!

I've no experience with this dataset, but any collection that would be used by other students or researchers is very welcome! Would you like to start writing the code to fetch and load the data and opening a Pull Request? Once open, we can review it on the Pull Request.

Please, read the Contributing Guide if you don't know how to open a PR and to check what requirements your code has to meet.

leouieda commented 5 years ago

@JustinGOSSES since there seems to be a lot of data cleaning, it would be best to have a processor that stores a cleaned copy of the data locally after download (if the data volume is not huge and the cleaning is somewhat time consuming)

JustinGOSSES commented 5 years ago

I imagine people would appreciate both ways, everything and just the relevant material for a specific purpose.

Looking over the code, it looks like fetch_prem would be a good model to base things on as it works with 1D data and pandas, which would be the way this would go.

For letting users pick what they load, looks like bedmap.py and its various datasets in dict format is what I should follow as an example?

# Load the ice thickness grid bedmap = rh.fetch_bedmap2(datasets=["thickness"])

JustinGOSSES commented 5 years ago

The code to convert the original data into a pandas dataframe will take a while to run as it is 2000+ original files and several files that must be merged. That might be awkward for end-user?

What is your Rockhound philosophy?

Would you prefer all the data be downloaded in original files AND then turned into a single dataframe even if that takes a while?
Would you prefer the end-user download these two things in separate steps:
- only a processed for dataframe H5 file that can be loaded into pandas dataframe quickly
- all the data in original file formats?

leouieda commented 5 years ago

its various datasets in dict format is what I should follow as an example?

Yep, that is probably the best approach for this as well.

What is your Rockhound philosophy?

We generally don't want to host any data ourselves, just download it from the original source. But you could publish a pre-processed version of the dataset in a more convenient format somewhere like figshare or Zenodo (license permitting) and we could fetch that instead.

Is the bottleneck the download or the merging/reading of the data?

JustinGOSSES commented 5 years ago

Bottleneck is the merging/reading.

Sounds like best thing to do is to create a totally pre-processed and ready to go dataframe, save it as a .h5 somewhere nice, then have rockhound load that into pandas dataframe with clear metadata on how to get the original data somewhere obvious in Rockhound.

leouieda commented 5 years ago

That sounds like a plan. It would be best if we can minimize the download and read time here.

JustinGOSSES commented 5 years ago

FYI. Will pull preprocessed datasets out of a new repo called: McMurray-Wabiskaw-preprocessed-datasets I may move to something like Zenodo as suggested once stable. For now want them someplace I can change easily without creating additional DOIs.

leouieda commented 5 years ago

:+1:

fatiando / rockhound

Add Manville well log dataset #42