It is a requirement of legislation that exploration companies operating in South Australia submit reports and data on their exploration activities. The South Australian Government has ~ 8000 digital exploration 'envelopes' or reports on mineral and petroleum exploration and activities dating back to the 1950's.
These data are all freely available via the SARIG web portal. In addition to the raw reports, the Geological Survey of South Australia (GSSA) has indexed these reports and provide a csv dataset which includes the envelope number, the tenement number associated with the report, a broad subject and a short summary or abstract of the report (see the raw data file).
There are 5894 abstracts provided in this data set. In these notebooks I present a way to utilise NLP (natural language processing) techniques to clean up the datasets and apply Latent Dirichlet Allocation (LDA) topic modelling to identify the main 'topics' discussed in the exploration report summaries. Once the main topics were identified I utilised the associated tenement numbers and their spatial boundaries (available as geodatabase or shape files from SARIG) to display a spatial distribution of the topics across South Australia.
You can find links to the two blog posts that work through and discuss the results of these two notebooks below:
The results suggest the states exploration record can be defined by 8 major topics:
The distribution of these topics across the state demonstrate which regions are prospective for different types of commodities or where exploration has been concentrated because of other factors like infrastructure or 'safe' brown-fields exploration targets.
While these results may not be all that surprising. This demonstrates some of the potential information stored in unstructured text based company data and some of the potential ways to begin to unlock that knowledge.
Requires 2 environments, one for the NLP topic modelling and one for the geospatial data analysis
conda env create --file NLP-env.yml
conda env create --file spatial-env.yml
├── README.md
├── requirements_nlp.txt
├── requirements_spatial.txt
├── notebooks
│ ├── Abstract LDA
│ │ ├── Figures
│ │ ├── Model
│ │ ├── Abstract_LDA_topic_analysis.ipynb
│ │ └── Abstract_LDA_topic_modelling.ipynb
│ ├── helper <--- helper functions
│ ├── create_env_dataset.ipynb <--- creating the dataset
│ └── text_preprocessing.ipynb <--- text preprocessing experiments
└── data
├── Processed <--- processed modeled topics
└── Raw <--- input abstracts and metadata