NBMG-UNR / resourcespace-migration

Code and tasks related to migrating GBSSRL into a CMS
0 stars 1 forks source link

Mining District Files #1

Open emilyodean opened 4 years ago

emilyodean commented 4 years ago

Basic task: figure out how to batch upload the attached excel spreadsheet to ResourceSpace.

You'll start with the Excel spreadsheet, and should end with whatever format that ResourceSpace needs to ingest. Do this in Python - so your script should start by loading in the attached excel file and should end with spitting out a CSV (not excel) that is formatted so that it can be immediately loaded into ResourceSpace. mining_district_files_12202019.xlsx

Note that whenever you do upload code associated with a ticket like this one, you can tag the ticket number in your commit comment.

emilyodean commented 4 years ago

There are several tricky things about this, and it's going to require quite a lot of data cleaning.

Metadata notes -

  1. [district] should be a controlled list. See attached for allowed values. district_info_01212020.xlsx
  2. [county] should be presented as a multiple choice picklist.
  3. [commodity] needs to be heavily QA'ed, and should also be presented as a multiple choice picklist. My guess is that there are a lot of typos here.
  4. [physical_location] should also be a controlled value list. Find unique values from this list to see what the values should be. Make sure there aren't typos and that every entry fits into the controlled value list.

Metadata fields that are uploaded to ResourceSpace will need to match fields that already exist in the program. So, first you will need to go into the admin panel and create these fields. For now, do this on the trial website that you created just so we can get the formatting of the csv correct and work through the process. Here are the metadata mappings from the spreadsheet to the fields and field types that you will need to make in resourcespace - md_field_map.xlsx.

Part of this will also be testing to see how ResourceSpace associates digital objects that we upload later with the metadata indices that we upload. It looks like they require the original file name as a field in the CSV. So, a good test would be to upload a small test CSV with a file name, then upload that file after the fact and make sure that the object associates with the metadata from the CSV.

emilyodean commented 4 years ago

@lweeks20 - the first thing I'd like you to do tomorrow is work on reconciling the commodities in the mining_district_files master spreadsheet (attached in the first comment) with this master commodities list here - nv_commodities.xlsx.

We want to make sure that each individual commodity listed under the commodities column corresponds with a controlled value, which we will obtain from the master commodities list. Your task is to identify which commodities in the mining district files are typos or are actually real commodities that need to be added to the commodities list.

The deliverable for this should be two updated excel spreadsheets and a python file- an updated mining district file spreadsheet where every commodity listed matches a commodity in the commodities master spreadsheet, an updated commodities master spreadsheet that lists unique commodities in the state, and a python script that you used to get both of these from point A to point B.

I've written a short example Python script to go ahead and extract the commodities that are different across the two spreadsheets, you can find it here: https://github.com/NBMG-UNR/resourcespace-migration/blob/master/clean_md_commodities.py. Feel free to make a branch to edit that code, work on your remote branch, and then submit a pull request to merge back into the master file.

Take your time to understand the task at hand and understand github. I don't expect this to be done in a day.

emilyodean commented 4 years ago

@lweeks20 - the next part of cleaning this dataset is to make sure that all of the mining districts and counties are valid. The master list for that is attached in the second comment - the one called district_info. It'll be a very similar process to the commodities and very similar code (though definitely simpler).

Also change all counties and mining districts to Proper Case (aka title case or camel case). I think there's a pandas function that you can just run on a column for this (md_files.district.title() or something similar).

emilyodean commented 4 years ago

district= district.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

emilyodean commented 4 years ago

@lweeks20 please also attach the updated excel spreadsheets for this ticket - should have a new mining district files spreadsheet and a new commodities spreadsheet. Don't overwrite the old ones - just update with today's date. Thanks!

lweeks20 commented 4 years ago

Okay sounds good! Do you want .xlsx or .csv for this and the general geology? @emilyodean

emilyodean commented 4 years ago

@lweeks20 Print to excel from the python code, and then we can convert to csv later if needed. Thanks!

lweeks20 commented 4 years ago

nv_commidities_02112020.xlsx minning_district_files_02112020_UPDATED.xlsx They're finally done! @emilyodean

emilyodean commented 4 years ago

@lweeks20 I just ran some quick QA on these (using the same code that we used to compare the commodities in the master list to the commodities in the mining district files the first time) and there are still over 600 commodities with misspellings. Did you attach the correct files here? I'm attaching the list here of all the misspellings that are still in the mining_district_files spreadsheet that you attached. incorrect_commodities.xlsx

lweeks20 commented 4 years ago

@emilyodean Looking through this list it looks like there are a lot of correct spellings just wrong formatting. However, there are some minerals in the list which are correct in spelling and are also correct in the format which should be in the commodities list (for example lithium, molybdenum, and bentonite are on this list). I'll review the commodities list to make sure it is correct as well as correct the formatting of the majority of the values from the xlsx sheet you just posted.

lweeks20 commented 4 years ago

@emilyodean Also do you want me to figure this out now or mix it in with the file up loading for Rosebud?

emilyodean commented 4 years ago

@lweeks20 please prioritize this - I need these to be corrected in order to move on with some of my work. Thanks!

lweeks20 commented 4 years ago

@emilyodean Okay perfect I'm on it

emilyodean commented 4 years ago

@lweeks20 can you give me an update on this and upload your most recent code? Please leave a comment and upload code every work day. Thanks!

lweeks20 commented 4 years ago

Sorry about that! It's going pretty good so far, there are a lot of "misspellings" where either the formatting is wrong or it isn't included in the commodities list. I've also found that in the mining district xlsx there are mining claims which are in the commodities column. So I've made a running sheet of all these discrepancies from "incorrect_commodities". As I've gone on I've deleted misspellings I've corrected and highlighted ones that need more attention: Green= present in nv_commodities Red= unclear commodity Yellow= commodities which need to be added Blue= mining claim purple= need more information @emilyodean running incorrect commodities.docx

lweeks20 commented 4 years ago

@emilyodean Things are coming along smoothly I'm still focusing on the misspelling of the commodities within the district files, there are so many it takes my computer about a full minute to run the code. I'm having problems with things such as "zinc (?)" or "(copper" because the code is thinking that the parenthesis are not apart of the value that is to be changed. I tried putting (\?) to get it to run but that then doesn't correct what needs to be fixed.

emilyodean commented 4 years ago

@lweeks20 As discussed over email, please just use this spreadsheet to correct the commodity misspellings. In the 3rd column next to the misspelling, provide the correct spelling. I'll then turn this into code. Keep another spreadsheet as well for any items that need to be updated in the master commodities list. Please attach both spreadsheets to this ticket at the end of the day (even if incomplete). incorrect_commodities_04142020.xlsx

lweeks20 commented 4 years ago

incorrect_commodities_04142020.xlsx @emilyodean so here's the edited list. Green= present in master list Yellow= needs to be added to list Red= needs to be removed from mining district files

lweeks20 commented 4 years ago

If you want me to I can find all the miscellaneous info for the yellow commodities to be added to the master list I can do that too

emilyodean commented 4 years ago

@lweeks20 Thank you! Can you please make this into 4 separate spreadsheets, one for each color (white, green, yellow, red)? I need this to be machine readable to translate into code.

lweeks20 commented 4 years ago

@emilyodean would you like different sheets on the same .xlsx file or multiple .xlsx files

emilyodean commented 4 years ago

@lweeks20 multiple would be great. Also please do a little QA on all of these classifications - just glancing through I noticed that "gysum" is highlighted in green but should be "gypsum," and others with some extraneous semicolons should be updated (e.g. "diatomite;" should be "diatomite").

emilyodean commented 4 years ago

@lweeks20 also if you have an updated commodities master list and could attach that to this ticket that would be great.

lweeks20 commented 4 years ago

@emilyodean I do not have an updated list because I was having the same problem as the district list. Here are the separate .xlsx files Remove.xlsx Need to be added.xlsx misspellings.xlsx Correct spellings.xlsx

emilyodean commented 4 years ago

@lweeks20 great, thanks! I'll start working on this and you can continue working on the mining district file metadata entry.