MFEh2o / db

**Contains the main issue tracker for the MFE DB!** Functions for interacting with the MFE database, in script format. (See also MFEUtilities, which is an R package that includes many/most of the same functions).
1 stars 0 forks source link

METADATA problems #50

Closed kaijagahm closed 4 years ago

kaijagahm commented 4 years ago

I've noticed some missing metadata files, or rows in METADATA that have a metadataID but aren't fully filled in. Going to list them here.

Can't find listed document

  1. metadataID "blgMorphologyDOC.20181114": can't find the document "blgMorphologyDOC.20181114.docx". I did find a document called "blgMorphologyDOC.20180626.docx"--is this what it's supposed to be?

  2. The following files can't be found in Box: Rhodamine.CR.20130701.docx, Rhodamine.CR.201407.docx, Rhodamine.EL.20130613.docx, Rhodamine.EL.20140725.docx, Rhodamine.HB.20130627.docx, Rhodamine.HB.20140725.docx, Rhodamine.MO.20130627.docx, Rhodamine.MO.201407.docx, Rhodamine.WL.20130730.docx, Rhodamine.WL.20140728.docx.

  3. Can't find the file "ZoopCounts.20110517.docx". There are some other files called ZoopCounts with different dates, and there's one called ZoopSurv.Sample with this date.

Empty METADATA entry cases where the metadataID is included in METADATA, but there is no other information in the table except for an updateID: e.g. no description, no file name or path, etc.

  1. "dvm.survey.05122013": I assume this has to do with Patrick Kelly's DVM project (ID 16) which we are also missing information for. @joneslabND do you know anything about this?

  2. metadataID "fishScapesLimno.20190319"

  3. metadataID "Limno.pH.Sample.20190322": This metadataID is associated with project 35: pH sediment bottles. @Randinotte says to ask @brittnibertolet about this metadataID: maybe there are some files that are so recent that they haven't been put in Box yet.

  4. metadataIDs "Rhodamine.BA.Baseline", "Rhodamine.BO.Baseline", "Rhodamine.BR.Baseline", and "Rhodamine.CB.Baseline", "Rhodamine.HB.20150608", "Rhodamine.MO.20150608", "Rhodamine.NG.20150609", "Rhodamine.NG.Baseline", and "Rhodamine.WA.Baseline" don't have any associated data in the table. Who was in charge of the Rhodamine project, and can we find associated metadata somewhere? @kaijagahm also check the Rhodamine table to see if there are any clues there.

Other

  1. metadataIDs "Invitrogen.SuperScriptIII.20100924", "MoBio.PS.DNA.20100924", "MoBio.PS.RNA.20100924", and "MoBio.PS.RNA+DNA.20100924" don't have associated files: descriptions say to see the manufacturer's instructions. Any reason we shouldn't upload those instructions to Box, for reference? [I could imagine one potential reason is that the instructions might change, have new versions, etc. and then we'd be on the hook to keep them up to date in Box].

  2. metadataID "Limno.Sample.20160505" has two different associated files: a field and a lab protocol. Is that the correct procedure--one metadataID, two different files? Or should there be two different metadataID's? If the former, then aren't there other cases where we could standardize back to one metadataID with two different files, to resolve the lab/field split?

  3. metadataID "DOC.20110601": the file DOC.20110601.docx doesn't come up when you search for it in Box, but if you dig into the Metadata Files folder, it is definitely there--on the 4th page. What's going on here? It should be searchable, no?

Easy fixes These take first priority because all I need to do is add some text to METADATA--don't need extra info.

  1. metadataID "FishscapesSurvey.hotBassLap.20180607": the document name should be "FishscapesSurvey.hotBassLap.20180607.docx", not "FishscapesSurvey.hotBassLap.20180607.doc" (because it's .docx in Box). Note that the document name in Box also had a trailing space after the 7--I fixed that directly in Box.

  2. metadataID "MarkRecap.20120228" should link to the file "MarkRecap.20120228.docx", which is in Box under Metadata Files, but that data is missing from METADATA. Add it.

  3. zoop.horiz.tow.Sample.20120501.docx, zoop.tow.Sample.20120501.docx, ZoopProd.20120216.docx, and ZoopSurv.Sample.20110517.docx should have filepath "Box/MFE/Database/Metadata Files"

kaijagahm commented 4 years ago

Easy fixes These take first priority because all I need to do is add some text to METADATA--don't need extra info.

  1. metadataID "FishscapesSurvey.hotBassLap.20180607": the document name should be "FishscapesSurvey.hotBassLap.20180607.docx", not "FishscapesSurvey.hotBassLap.20180607.doc" (because it's .docx in Box). Note that the document name in Box also had a trailing space after the 7--I fixed that directly in Box.
  2. metadataID "MarkRecap.20120228" should link to the file "MarkRecap.20120228.docx", which is in Box under Metadata Files, but that data is missing from METADATA. Add it.
  3. zoop.horiz.tow.Sample.20120501.docx, zoop.tow.Sample.20120501.docx, ZoopProd.20120216.docx, and ZoopSurv.Sample.20110517.docx should have filepath "Box/MFE/Database/Metadata Files"

Fixed 11-13.

ctsolomon commented 4 years ago

Re 1 - blgMorphology - Chelsea Bishop (or perhaps Alex Ross) will be contact. We should make sure we know whether there were two versions of the metadata description (i.e. a 20180626 and a 20181114), and if so whether all data should get the more recent version or whether some data get the older version.

Re 2 - rhodamine - I would start with @joneslabND, although he might pass you to Jake Zwart.

Re 3 - zoop counts - Can you give me a quick summary or description of what data show up in database as being associated with "ZoopCounts.20110517.docx" document? If I know this and I look at the set of files you mention I may be able to sort this out, though I may have to pull in a couple other people.

Re 4 - Stuart is your person

Re 5 - fishScapesLimno - I think Alex led this sampling. You could ask Colin if he or anyone else who is still on the project helped out - if so, let's task them with this, if not we can bug Alex. I think the thing to do is to send whoever you task this to our main limno sampling metadata description, and maybe a couple other specialized project-specific ones if we have them, so the person has a model. Ask the person to write a metadata description that is either stand-alone or else one that refers to one of the other existing ones, but highlights different things (if any) that we did for FishScapes limno sampling.

Re 7 - rhodamine - again I would start with Stuart.

Re 8 - Stuart

Re 9 - field and lab limno files - I believe our intended procedure for this is to have two separate metadataIDs, one for field sampling and one for lab workup. Each ID would have its own associate file. Is this the way some of the early limno sampling (~2011) is set up? @joneslabND might also have opinions here.

Re 10 - Box searchability - I don't know anything about this. One would hope it would be searchable. There's not some minor character difference in the file name or something?

joneslabND commented 4 years ago

re 2 & 7 - rhodamine - please reach out to Jake Zwart. I will e-introduce you now.

re 3 & 4 - zoop counting and DVM - please reach out to Patrick Kelly

re 6 - yes please talk to Brittni Bertolet

re 8 - SuperScript documents - https://www.thermofisher.com/document-connect/document-connect.html?url=https%3A%2F%2Fassets.thermofisher.com%2FTFS-Assets%2FLSG%2Fmanuals%2FsuperscriptIII_man.pdf&title=U3VwZXJTY3JpcHQgSUlJIFJldmVyc2UgVHJhbnNjcmlwdGFzZQ== PS.DNA documents - https://www.qiagen.com/us/resources/resourcedetail?id=5c00f8e4-c9f5-4544-94fa-653a5b2a6373&lang=en PS.RNA documents - https://www.qiagen.com/us/resources/resourcedetail?id=cc44f2e0-52fd-4ed4-93ce-89692dbbfdb1&lang=en PSRNA+DNA documents - can't find this one, but you could try and google a bit more; this was an "add on" kit that allowed extraction of DNA along with the RNA extraction (PS.RNA) kit linked to above

re 9 - yes, we originally (and I thought always) had separate metadata and documents for field vs. lab limno work

re 10 - I'm not too worried about searchability, but this behavior is odd

kaijagahm commented 4 years ago

Updates, so I can keep track of these-- 1: Emailed Chelsea. 2 and 7: Will email Jake Zwart after @joneslabND's intro. 3 and 4: Emailed Patrick 5: Resolved! I found the file--not sure how I'd missed it before. Added information to METADATA. 6: Will ask Brittni in our meeting on Friday. 8: Adding Stuart's documents to Box. 9: Got it. Seems like the thing to do is to change these metadataID's to differentiate sampling vs. lab. I'm not familiar enough with how the sampling works to be immediately clear on how I should determine which samples/rows get which metadataID, but I will start by doing some digging in the database to see if I can figure it out. If I can't, I'll raise it at the Tuesday meeting. 10: No slight character differences. I think this is just weird behavior by Box--not a huge deal but a little concerning. Should keep this in mind if ever searching for files in the future: just because they don't come up in search doesn't mean they're not there.

kaijagahm commented 4 years ago

Found another problem:

  1. MinnowTrapSurvey.20120228 metadataID doesn't have any associated data [edit: no associated data in the METADATA table, e.g. no associated file or description]. Note that the file MarkRecap.20120228.docx, which is associated with MarkRecap.20120228 metadataID, has the same date and does have a section on minnow trapping. Is this also the doc that MinnowTrapSurvey.20120228 should refer to, or is there a different one?

It looks like this metadataID is associated with projectID 3, for which @joneslabND and @ctsolomon are project leads. Can you point me to a metadata file for this one?

kaijagahm commented 4 years ago

Current status:

  1. Emailed Chelsea again to confirm that it was only the metadata writeup that was updated, not the protocol itself. Will check in on Tuesday about this to figure out how to handle it. 2 and 7: emailed Jake; waiting to hear back about Rhodamine files. 3 and 4: emailed Patrick; waiting to hear back. 6: asked Brittni; waiting for her to send me this metadata file. 8: added the manufacturer instruction documents to Box. Googled and could not find instructions for the RNA + DNA one either. Should we pursue this or does it not matter much? 9: Let's talk at the Tuesday meeting. 14: need contact person or info.
kaijagahm commented 4 years ago

Re 2 and 7: Jake responded. Need to check the Jones Lab computer for a lot of these files. He did give me a description to fill in for the ones that are only missing that. Filled in those descriptions.

kaijagahm commented 4 years ago

2 and 7: passed the email chain with Jake along to Randi. She'll talk to some people about searching for Rhodamine files in the UNDERC computers. Meanwhile, Stuart will check the Jones lab computers.

1: forwarding to Chris to make sure I'm doing the right thing w/r/t/ what Chelsea said about the different metadata versions.

3 and 4: still waiting on Patrick

6: still waiting on Brittni

8: further updated with a description from Stuart. Don't need to pursue further--the company no longer makes the DNA+RNA product in question.

9: Seems like in this case, and in several others where the metadataID has two rows and two associated files, those files could be condensed into one because it's a field protocol and a very basic lab protocol like filtration (which would always be combined with the field protocol). I'm going to search through both METADATA and Box to try to find all instances of those files, and condense them, checking with Stuart and Chris before doing the condensing to make sure I'm doing it right.

14: @ctsolomon @joneslabND I forgot to bring this up at the meeting because it was added a little later. Can you provide any clarity on the MinnowTrapSurvey.20120228 metadataID? Quoted below:

  1. MinnowTrapSurvey.20120228 metadataID doesn't have any associated data. Note that the file MarkRecap.20120228.docx, which is associated with MarkRecap.20120228 metadataID, has the same date and does have a section on minnow trapping. Is this also the doc that MinnowTrapSurvey.20120228 should refer to, or is there a different one?

It looks like this metadataID is associated with projectID 3, for which @joneslabND and @ctsolomon are project leads. Can you point me to a metadata file for this one?

kaijagahm commented 4 years ago
  1. Because this is a long and complicated one, I created this google doc, which links to and briefly describes each of the files related to the duplicated metadataID's. For each duplicated metadataID, I propose what I think we should do about it. @ctsolomon @joneslabND @Randinotte can you take a look at this and give me the go-ahead on those proposals, or make any changes?
ctsolomon commented 4 years ago

14: I suspect that "MinnowTrapSurvey.20120228" was created for Nikki Craig's fish diet survey in 2012. Based on the creation date (end of Feb) it would have been as she was planning the work. The actual metadata description for that work (DietSurvey.20120624, I believe - I found this in my archive of her dissertation files, in the folder for the chapter she did on fish feeding, and it also shows up with other metadata files in Box) says that fish were collected both by minnow trap and by seine. It might be that she initially planned in Feb to do the sampling just by minnow trap and then added seine during the field season - so that DietSurvey... would be the only correct metadata description for the work that actually happened. It wouldn't surprise me if she then just didn't think to delete the original MinnowTrap... metadataID from the database. This seems consistent with your finding that there is no data in the database associated with that metadataID and with my memory that I checked that all of Nikki's data had been put into database before she graduated.

The MarkRecap... metadataID that you reference is about our mark-recapture methods on Long Lake, which should be associated with projectID 3 as you say.

ctsolomon commented 4 years ago

9 - I commented on the google doc. I think you hit the nail on the head on all of these that I could decide on. There are a few that Stuart is better positioned to decide on.

kaijagahm commented 4 years ago

9 - I commented on the google doc. I think you hit the nail on the head on all of these that I could decide on. There are a few that Stuart is better positioned to decide on.

Thanks, Chris! @joneslabND, when you get a chance, can you take a look at the google doc referenced above and see if my plans sound okay?

@ctsolomon regarding 14 MinnowTrapSurvey.20120228 (Nikki Craig)––I'm sorry, I should have been more clear. When I said there was "no data" affiliated with that metadataID, I meant that there was no additional data in the METADATA table. It looks like there is data in FISH_SAMPLES for that metadataID--I see 5172 samples. So it looks like she did end up using that metadataID after all. Here's a text file of those rows from FISH_SAMPLES, if you want to look at them: MinnowTrapSurvey.20120228.FISH_SAMPLES.txt

Given this info, do you think I should link "MinnowTrapSurvey.20120228" to the DietSurvey doc, or do we have reason to believe that there would be a different metadata file somewhere?

kaijagahm commented 4 years ago

At our meeting, we also talked about working the other way: finding files in Box and checking to see if they're in METADATA. Randi already opened another issue about this: #13. I'm going to move over there to document this part of the problem.

ctsolomon commented 4 years ago

Re 14 - let's talk about it on Tues

joneslabND commented 4 years ago

I agree we should talk on Tuesday about 14. Based on number of observations, etc. I think this might include minnow trapping on Long Lake.

joneslabND commented 4 years ago

@kaijagahm I went through the google doc and commented

kaijagahm commented 4 years ago
  1. metadataID "FishscapesSurvey.hotBassLap.20180607": the document name should be "FishscapesSurvey.hotBassLap.20180607.docx", not "FishscapesSurvey.hotBassLap.20180607.doc" (because it's .docx in Box). Note that the document name in Box also had a trailing space after the 7--I fixed that directly in Box.
  1. After adding "FishscapesSurvey.hotBassLap20180607" to METADATA (see issue #13), I realized we now have two different metadataID's in the database for this, one with a period between Lap and 2018, and one without. So we have FishscapesSurvey.hotBassLap.20180607 and FishscapesSurvey.hotBassLap20180607. As far as I can tell, they're both associated with the same document, whose file name does have the period. There is data in the database associated with both of these metadataID's--seems that they weren't standardized properly. There are also two versions of the sampleID's--sampleID's correctly correspond to the listed metadataID, so there are some sampleID's with a period and some without.
kaijagahm commented 4 years ago

I agree we should talk on Tuesday about 14. Based on number of observations, etc. I think this might include minnow trapping on Long Lake.

Stuart created this doc. To do: "Could you please look in the database for deployment times (difference between dateTimeSet and dateTimeSample) for minnow traps (MT) in each year that they were deployed in Long Lake?" draftMTmetadata.docx

kaijagahm commented 4 years ago

Update: 1 Changed metadataID's appropriately for Chelsea's blg morphology project (1114 to 0626). _Note: changed 0627, 0628, 0629, and 0630 in the SAMPLES_FISH_SAMPLES_sampleID_problems_gh51_gh53.R script, not in this one.

2 and 7 @Randinotte @joneslabND any updates on finding the rhodamine files? Is there anyone else who I should ask who might have access to different/more computers?

3 and 4 Still waiting to hear back from Patrick. He emailed me today that he hasn't forgotten and will hopefully get back to me by the end of the week.

6 Followed up with Brittni today.

9 Fixed the multi-row metadataID's.

14 @joneslabND unfortunately, the minnow trap deployment times don't seem to be straightforward. Here's a text file that shows deployment times for each year. The column n is just the number of rows in FISH_SAMPLES that had that deployment time--this may or may not be a useful metric, since it will vary with the number of fish caught. As you can see, deployment time varied a lot within years. You'll also notice that there are a bunch where the deployTime is negative, which shouldn't be happening. I'm going to open a new GH issue to look into this: I don't know what's going on.

15 Fixed the hotBassLap metadataID's.

joneslabND commented 4 years ago

There were no Rhodamine files on the computers in South Bend. Not sure if the computers at UNDERC got checked yet...

Thanks for creating the minnow trap deploy time issue!

kaijagahm commented 4 years ago

Still open: 2 and 7 Rhodamine files. I emailed Joey Vanderwall two weeks ago and got no response. Followed up today. Edit: RESOLVED! Got the files we needed.

3 and 4 Followed up with Patrick again today. Edit: he says he'll have them by the end of the week

6 Followed up with Brittni again today.

14 Waiting for Stuart to check physical MT data sheets: see #60.

kaijagahm commented 4 years ago

From #13:

There are a few metadataID's listed in METADATA that don't have any associated data in the database. Is this a problem? They are: GC5890.CH4.CO2.20110601, GC5890.CH4.CO2.20120618, pCO2.20110901, pCO2.20130318, Piezometer.SlugTests.ChemSamples.20180523, POC.20110601, and zoobenthos.Sample.20110519.

@ctsolomon @joneslabND do you have an immediate feeling on this? I guess it's a database philosophy question. Does it bother us to have metadataID's in the METADATA table when there's no data associated with them? Note that most if not all of these do have an associated file in Box.

My sense is that I'm not eager to delete them, in case they link to e.g. data that hasn't been entered yet, or something, but maybe this warrants further investigation.

UPDATE FROM MEETING 8/11 err on the side of not deleting things. GC data is probably somewhere, just not in the database yet. For POC and zoobenthos: may have written metadata and then very quickly decided to change the protocol, before collecting data. In that case, it's nice to have a record of the thought process that went into the metadata, so can leave as is. For the zoobenthos metadataID: can check the long R script that Chris sent a while back--he did some poking around and may have figured that out. Only remove that one if it's an actual duplicate.

at some point, want to check what the sampleID is for DOC measurements in the DOC table that came from a Piezometer (site == e.g. WL_P1). metadataID would be for DOC analysis, but the sampleID should have a different metadataID--could be limno sample, but maybe should be "Piezometer.SlugTests.ChemSamples.20180523".

kaijagahm commented 4 years ago

Brittni got the file squared away. Still waiting to hear from Patrick--will bug him again tomorrow if I haven't heard.

Added Stuart's draft MT metadata file to Box; we can update it when we know more about the deploy times.

Added a placeholder doc for Patrick's dvm metadata to Box; we can likewise update it when I get the information from him.

Have added the updateID "metadataFix.2020" to UPDATE_METADATA in meta_3.4.0.R

kaijagahm commented 4 years ago

Received Patrick's metadata files. Only waiting on the MT issue, and we've decided to move forward on FigShare before finalizing that.

kaijagahm commented 4 years ago

Pushed all changes to db version 3.4.0 (20200820). Moved remaining problems over to #71. Closing this.

kaijagahm commented 3 years ago

I'm pretty sure this document, which I found on my Google Drive, pertains to this issue. No action should be required, but I wanted to put the document somewhere so it doesn't disappear into the ether when I leave. METADATA multiple docs for one metadataID.pdf