RADutchie / SA-exploration-topic-modelling

4 stars 2 forks source link

Concept tagging #2

Open RichardScottOZ opened 3 years ago

RichardScottOZ commented 3 years ago

That model was trained on abstracts, so I should run over the dataset here to see how the keywords/topics compare.

RADutchie commented 3 years ago

Yes it was. I have also run a similar model using the full ocr documents but ran into issues with ocr transcription spelling mistakes. So decided to just run with the abstracts as a proxy.

RichardScottOZ commented 3 years ago

Yes, have to deal with that, unfortunately, lots of work to get old ones in shape

RichardScottOZ commented 3 years ago

Just pulling one out of your 'processed' as an example

Preliminary review of oil and gas possibilities of Saint Vincent Gulf graben and adjacent.,1961,Hard Copy Digital,Petroleum exploration;Geophysics,Earlier draft of report in Env 8. Differs by containing appendix of logs or previously drilled bores.,OEL00024,include geological log follow bore croydon bore hd yatala pethick bore oakland hd adelaide secn hd dublin secn hd grace secn inkerman balaklava coal bore hd inkerman secn bear ia govt minlaton stratigraphic bear hd ramsay secn minlaton township bore hd ramsay secn peninsula oil bore peezie swamp hd moorowie secn kingscote bore hd menzie

{ "code": 200, "interface_version": "2.0.0", "messages": [], "payload": { "features": [ { "Copy": "PROPN", "Copy Digital": "NOUN_CHUNK", "Digital": "PROPN", "Env": "PROPN", "Gulf": "PROPN", "Petroleum": "PROPN", "Petroleum exploration;Geophysics": "NOUN_CHUNK", "Saint": "PROPN", "Saint Vincent Gulf": "ENT", "Saint Vincent Gulf graben": "NOUN_CHUNK", "Vincent": "PROPN", "adelaide": "PROPN", "appendix": "PROPN", "balaklava": "PROPN", "bear": "PROPN", "bore": "PROPN", "coal": "PROPN", "croydon": "PROPN", "differ": "NOUN", "draft": "NOUN", "dublin": "PROPN", "dublin secn": "ENT", "early draft": "NOUN_CHUNK", "exploration;Geophysics": "PROPN", "follow": "NOUN", "gas": "NOUN", "geological log follow bore croydon bore": "NOUN_CHUNK", "govt": "PROPN", "govt minlaton": "ENT", "graben": "PROPN", "grace": "PROPN", "grace secn": "ENT", "hd": "PROPN", "hd dublin secn": "NOUN_CHUNK", "hd grace secn inkerman balaklava coal bore": "NOUN_CHUNK", "hd inkerman secn": "NOUN_CHUNK", "ia": "PROPN", "ia govt minlaton stratigraphic bear hd": "NOUN_CHUNK", "inkerman": "PROPN", "kingscote": "PROPN", "log": "PROPN", "logs": "PROPN", "menzie": "PROPN", "minlaton": "PROPN", "oakland": "PROPN", "oakland hd": "NOUN_CHUNK", "oil": "NOUN", "oil gas possibility": "NOUN_CHUNK", "peezie": "PROPN", "peninsula": "PROPN", "pethick": "PROPN", "possibility": "NOUN", "preliminary review": "NOUN_CHUNK", "previously drill bore": "NOUN_CHUNK", "ramsay": "PROPN", "report": "NOUN", "review": "NOUN", "secn": "PROPN", "secn minlaton township bore": "NOUN_CHUNK", "secn peninsula oil bear peezie swamp hd moorowie secn kingscote": "NOUN_CHUNK", "stratigraphic": "PROPN", "swamp": "NOUN", "swamp hd moorowie secn": "ENT", "township": "PROPN", "yatala": "PROPN", "yatala pethick bore": "ENT" } ], "probability_threshold": 0.5, "request_id": "0", "sti_keywords": [ [ { "keyword": "coal", "probability": 0.7326683402061462, "unstemmed": "COAL" }, { "keyword": "oil", "probability": 0.6701714396476746, "unstemmed": "OILS" }, { "keyword": "coal utilization", "probability": 0.5211608409881592, "unstemmed": "COAL UTILIZATION" } ] ], "topic_probabilities": [ [ { "keyword": "geoscience", "probability": 0.5591679072502963, "unstemmed": "geosciences" }, { "keyword": "engineering", "probability": 0.2614608363329102, "unstemmed": "engineering" }, { "keyword": "chemistry and material", "probability": 0.15407505604337968, "unstemmed": "chemistry and materials" }, { "keyword": "space science", "probability": 0.13549560874769614, "unstemmed": "space sciences" }, { "keyword": "physic", "probability": 0.06558151010030258, "unstemmed": "physics" }, { "keyword": "general", "probability": 0.06452260144730994, "unstemmed": "general" }, { "keyword": "astronautic", "probability": 0.05947382919879488, "unstemmed": "astronautics" }, { "keyword": "social and information science", "probability": 0.025058771773934216, "unstemmed": "social and information sciences" }, { "keyword": "mathematical and computer science", "probability": 0.021955482109084806, "unstemmed": "mathematical and computer sciences" }, { "keyword": "life science", "probability": 0.016848774023474587, "unstemmed": "life sciences" }, { "keyword": "aeronautic", "probability": 0.014245884250750528, "unstemmed": "aeronautics" } ] ], "topic_threshold": 1 }, "service_version": "unspecified", "status": "okay" }

RichardScottOZ commented 3 years ago

I have an internal version, but looks like NASA's example ec2 machine is still up, too

http://ec2-100-25-26-114.compute-1.amazonaws.com:5001/

RADutchie commented 3 years ago

I'd forgotten about this, thanks for the reminder!

RichardScottOZ commented 3 years ago

I'll run your 'processed version' through ours and upload it.

RADutchie commented 3 years ago

My initial ideas with playing around with topic modelling was to be able to automate the generation of keywords, topics and generate a summary of the exploration reports. But again my biggest problem is the quality of the OCR'ing, even for the newer digital documents there are a number of issues. I may resort to try and pull the digital text straight out of the pdf's. This eg of an api to do this is really cool.

RichardScottOZ commented 3 years ago

Something like this, based on your processed-normalised.

Concept-Tagging-KeyWords2.txt