VTUL / vtechworks

DSpace at Virginia Tech
http://vtechworks.lib.vt.edu
Other
6 stars 8 forks source link

URGENT: Files associated with the wrong records #159

Closed amandafrench closed 8 years ago

amandafrench commented 8 years ago

Numerous (or all) records in the College of Science and at least a few records in the College of Engineering on production have the wrong files associated with the records. Click on the PDF (or other bitstream) for any of the following to see the issue:

Note that each of the handle numbers in the URLs above does match the bitstream handle, but it does NOT match the handle.net URL listed in the item metadata record. See screenshot: note URL, note bottom left hover URL for bitstream, and note handle listed in record.

screenshot 2016-01-06 17 01 19

Possibly due to whatever SQL command @keithgee issued the other day?? If so, this is why I trust @kayiwa et al when they want to prevent us from issuing SQL commands to production. :) Might also be due to problems with DSpace batch metadata export and import: @mello99 was doing that for the College of Science and the College of Engineering last month, and so far I haven't been able to reproduce the issue in any other community.

keithgee commented 8 years ago

I've been looking at this, and I believe that the handles are correct and the identifier.uri metadata is incorrect. I believe that it doesn't make sense to try to restore the database. The last updates that I see to these items were on 12/02 and 12/03 by @mello99, and on 12/09 by @amandafrench. That's long enough ago that we'd undo a lot of work if we restored the database to that point (if a backup is even available), and I'm not sure that it would correct the problem. Here are two ideas I'm pursuing for correcting this:

  1. Use the batch metadata editing tool to correct the identifier.uri field to the setting of the handle. One concern about this is that it may have been a batch metadata editing bug that caused the problem.
  2. Write some code to connect to the database and update the identifier.uri metadata field to match the value contained in the handle database for each item. The concern here is that I haven't done this for a few years so it may take a little time, and the testing is important.

I think that we need to test either of these ideas on dev, so @amandafrench I'm sorry but this may not be an immediate fix.

Did either of you use the batch metadata editing tool? If so, any artifacts you saved may help resolve the problem.

keithgee commented 8 years ago

@amandafrench your title says that items have the wrong files associated with them. The last example in particular makes me think that it's not just the identifier.uri that's wrong, but the metadata is all scrambled up. So, in that last example, note that the file in the article matches the abstract metadata; it's the title metadata that's wrong.

amandafrench commented 8 years ago

I think both of us used the batch metadata editing tool, yes. But here's a wrinkle: I located the unedited metadata export file on my hard drive from the College of Science from 12/9/15. Everything looks just fine in the spreadsheet: it has the correct handle listed in dc.identifier.uri for the accompanying metadata. So, for instance, http://hdl.handle.net/10919/47889 is associated with "Spectral density and magnetic susceptibility for the asymmetric degenerate Anderson model" by Zhang and Lee. (this one looks correct at 7a.m. on Jan. 7, at least via a cursory glance on item view summary -KG) But importing the correct metadata in that csv file using DSpace Import Metadata still doesn't correct the data in the collection. I've put the file, titled 10919-5553.csv, in the "Metadata" folder in the VTechWorks Team Google folder.

keithgee commented 8 years ago

The suggestion above about writing a script to update the identifier.uri with the correct handle isn't a good idea anymore, I think. I misunderstood the problem, even with your title. I think that all items with the problem need to be identified, and that all of the metadata should be reviewed and corrected.

DSpace has a tool to update items - this may be useful as well. @amandafrench thanks for the spreadsheet.

amandafrench commented 8 years ago

And by the way @mello99 has lots of versions of those spreadsheets in that same folder.

keithgee commented 8 years ago

@amandafrench why do you say that 47889 is the correct handle for h "Spectral density and magnetic susceptibility for the asymmetric degenerate Anderson model"

keithgee commented 8 years ago

EDIT: wrong I believe that 5937 is the correct handle for that item.

keithgee commented 8 years ago

Perhaps not. I'm quite confused at the moment.

amandafrench commented 8 years ago

I am too: :) That might have been a poor example: the record at http://vtechworks.lib.vt.edu/handle/10919/5937 lists both those handles in the dc.identifier.uri field.

amandafrench commented 8 years ago

So, okay, here's what I mean by saying the spreadsheets look correct. http://hdl.handle.net/10919/48016, according to the spreadsheet, should go to an article called "Santini, P., et al., "The evolution of the dust and gas content in galaxies," A&A 562, A30 (2014). DOI: 10.1051/0004-6361/201322835." Naturally it doesn't at the moment, but if you do go to that handle the file is of that article. Also, if you go to the DOI at 10.1051/0004-6361/201322835, that's the Santini evolution of dust article.

KG Edit: This one looks normal again to me at 7 A.M. January 7

amandafrench commented 8 years ago

So essentially the handles are pointing to the correct bitstreams but the wrong metadata records.

amandafrench commented 8 years ago

Okay, I'm going home now and letting my poor cold cats in.

AF Edit 1/8/16: Cats were indeed cold, but otherwise fine. They refused to go outside the next morning, though.

mello99 commented 8 years ago

Okay @amandafrench , @alawvt , and @keithgee , after looking through the spreadsheets I think I found where I completely, utterly, and totally screwed up. I moved the handles to the "dc.identifier.uri[en_US]" field instead of the "dc.identifier.uri" field (no encoding), which is probably the source of the error. Also, DSpace has not fully reindexed the old spreadsheet that was uploaded this afternoon, even though Francis says that he triggered a reindex (I can tell because the content types haven't changed back to what they used to be; there were far more "article" content types). I think this is another case of us having to wait a few days (or hopefully one day) for the changes to show up.

I can't apologize enough for the hassle, I feel awful. I will be much more careful next time and won't bulk-upload anything except collections containing fewer than 100 items. Mea culpa, mea culpa, mea maxima culpa :(.

keithgee commented 8 years ago

No worries, @mello99. We'll find a way to fix everything soon. Just make a note of where the spreadsheet is that has the most correct information so that we can work on it tomorrow. No need to spend anymore time on it tonight!

Thanks for letting us know what happened! That's awesome and will let us get things back to normal.

amandafrench commented 8 years ago

Yes, no worries, @mello99 -- much better to be a hard worker who makes mistakes sometimes than be someone who never does anything! This could easily have been (and still might prove to have been) me. I committed one or two howlers with the THATCamp user database in my time, believe me.

One thing, though: it's the entire metadata record that seems to be incorrect in relation to the files, not just the handle listed in the "dc.identifier.uri" file -- are we sure it's not the item "id" fields that got messed up? But yes: it makes sense that reindexing would have to happen first before we see corrections.

By the way, I did a lot of spot-checking just now in the Special Collections community, since we did a lot of batch metadata editing there too, and those all seem fine.

keithgee commented 8 years ago

@amandafrench, I see what you're saying about it being not just identifier.uri. It's inconsistent now. Sometimes some of the metadata is correct - authors for example. Sometimes tonight when I've looked the abstract is correct and title is wrong. Other times it is the opposite. I'm not sure if it was this way when I started looking at things or if it's changed. We'll fix this all tomorrow. Please, please nobody try to fix this anymore in production yet until we test on development, I think it may be changing things and it makes it very hard to figure out what happened and to make a good plan to fix it. It's 9:47 PM right now and I see what looks like a batch upload as recently as 9:34 p.m. Please nobody do anything else on production to try and fix this right now. When the data is inconsistent, ordinary things can become unpredictable, plus, when things continually change, it makes it very hard to both figure out the problem and to plan a fix. Wait for me, trust me. :)

keithgee commented 8 years ago

@mello99 I still see updates happening but I don't know how to contact you.

keithgee commented 8 years ago

And sorry - I should have made it clear earlier that it might be best not to change a lot on the system right now until it's back to normal, including work on other communities and collections.

keithgee commented 8 years ago

And @amandafrench , @mello99 has also placed a very large PDF file in that metadata directory that shows which metadata was added/deleted on the 12/03 batch change. Very useful to have, especially if it matches the current state of the system.

mello99 commented 8 years ago

Sorry @keithgee , I was uploading batches of 50 items apiece because DSpace will index those automatically (we don't have to wait a day, or 2 days, etc.). I got up to 500 out of the 991, so 500 items in the College of Science collection are correct, but I will stop uploading now (let me know if you want me to restart). Thanks so much for your help, and thank you, @amandafrench , for your spot-checking and your compassion - I really appreciate it.

keithgee commented 8 years ago

@mello99 is it fixing them? If you still see this call me at xxx-xxx-xxxx

mello99 commented 8 years ago

Yes, it's fixing them. Still want me to call you?

keithgee commented 8 years ago

yes

mello99 commented 8 years ago

Here's one of the spreadsheets - 10919-5553_35584_34167.csv.zip

keithgee commented 8 years ago

@mello99 demonstrated that her upload of 50 items at a time from the spreadsheet that she originally downloaded on December 2 is fixing things. Included so far is the item Amanda mentioned at http://vtechworks.lib.vt.edu/handle/10919/5937.

She will handle corrections for this problem, and please nobody will change anything on production while she is working on it, especially me!

mello99 commented 8 years ago

I went back and uploaded all of the items, 50 per spreadsheet. I'm still having trouble with the first 50 items, which include many articles from the Dept. of Chemistry; for some reason DSpace isn't processing the changes. I'll try uploading a smaller batch (less than 50) and see what happens. The other content, beyond the Dept. of Chemistry articles, looks consistent now.

mello99 commented 8 years ago

Okay, now some of the uploads aren't taking, for some reason. I think that most of the files are okay except for several in the Dept. of Chemistry and Dept. of Physics collections, but I'm exhausted and need to get some sleep. I don't specifically know what's going on; I just know that I don't have the energy to upload each of the ~990 files one-by-one in one evening. It would be super if I could just upload the main spreadsheet and let DSpace process it, but I have no idea whether DSpace actually WILL process it (or how long it will take DSpace to process it), sigh. I'll get back to this in the morning after the Atmire meeting.

mello99 commented 8 years ago

This item is a mess - http://hdl.handle.net/10919/24398. I've reuploaded the metadata for the metadata (very meta, I know) and the metadata for the file, and yet there's still this discrepancy. Hopefully all will be well when Tomcat restarts (fingers crossed). At least Dr. Merola's articles are correctly synchronized.

KG Edit: This one looks good as of 7AM January 7

keithgee commented 8 years ago

Sleep is definitely important! I was wondering, too, just how many uploads would be required to fix all of the items. That's a lot of work!

If this still seems to be working, maybe we can split this work up - if you trust us - and/or try slightly larger uploads. I like this method because it seems successful so far. As of 6 A.M., Both solr and the VTechWorks xmlui have been restarted. @mello99, the item that you mentioned at http://hdl.handle.net/10919/24398 looks fine to me at the moment, but I might be missing something or it may have finally updated:

Looks good! Maybe VTechWorks finally caught up?

alawvt commented 8 years ago

I tried changing the handles of a couple items in my local vm from the dc.identifier.uri field to the field dc.identifier.uri[en_US] both manually and with the metadata upload tool. This did not mess up the handle or metadata, so I don't think that was the cause of this issue. I do recall that @mello99 reported that metadata upload tool seemed to hang sometimes on large spreadsheets. I wonder if one or more of them were processed incorrectly.

keithgee commented 8 years ago

@alawvt, no I don't think that caused the issue, either. When we looked at the oldest csv file that we had for these items, the "id" column ( the leftmost column) was incorrect for several items. (EDIT KG: Or, if the id column was correct, at least some of the other metadata didn't match the metadata for that id) That explains why the metadata and the items(with the corresponding bitstreams) were mismatched. What's still a mystery to me is why the IDs were incorrect in the CSV file. Possibilities I can think of are DSpace bug, Excel bug, human mistake. In any case, I think we can monitor closely for this problem in the future by carefully reviewing the list of changes that DSpace says it will make when the CSV file is uploaded. Melissa has been meticulous about saving this list of changes. Also, when making changes with the batch metadata editing tool, it's possible to delete most of the columns in the CSV file if they aren't part of the metadata changes. Example: delete any abstract and title columns from the spreadsheet if we're only working on changing author names. That might make it easier to look at the spreadsheet, and also faster for DSpace to process the changes. Thanks for your work today in identifying mismatched items.

mello99 commented 8 years ago

So sorry for the previous comment; I accidentally responded to Keith's comment from 2 nights ago, lol. Thanks, @keithgee and @alawvt for your warm words and for continuing to work on this. @keithgee , please try to enjoy your weekend! Thank you!

mello99 commented 8 years ago

Okay, so I've been looking through the 12/02/2015 College of Engineering spreadsheet for mismatches between files and metadata, as well as any other issues that might crop up. Of the 990 items in the College of Science collection, I've gone through (one-by-one) 740.

49 items, thus far, are flat out mismatches (see the last column, column I, on the spreadsheet for the "Notes" field). 7 items were duplicates that had no associated files (these are the items referenced in #161 ); I went ahead and deleted these files per Keith's instructions. 1 item doesn't have an associated file (http://hdl.handle.net/10919/25051), but I'm not sure why. Another item was withdrawn from the repository (http://hdl.handle.net/10919/52750), but again, I'm not sure why (or when).

I'm going to try to finish this up tomorrow, and I will move quickly. I won't be in until 11 am today because I've been up late. Thanks for everything.

10919-5553_Mismatches_copy.csv.zip

keithgee commented 8 years ago

I'm using this information to restore metadata, starting with the third row in your table (you had questions about the first two), to get a feel for how long this strategy will take. I'm doing this based on the item id, because "identifier.uri" is metadata and not reliable in our scenario. Remind me at the stand-up, @mello99, to address your other questions either then or afterward.

alawvt commented 8 years ago

Our procedure, "Editing metadata with Batch Metadata Editing tool," https://redmine.lib.vt.edu/projects/vtw_content_acquire_ingest/wiki/Editing_metadata_with_Batch_Metadata_Editing_tool#section-5, includes downloading and saving the metadata for a community or collection before a metadata change and saving the spreadsheet of changes. I included a warning about DSpace's sometimes failure to confirm an upload and an optional step to check the metadata after upload. Please feel free to edit it the procedure.

keithgee commented 8 years ago

@mello99 metadata for the items you identified as problematic in your spreadsheet attachment (except for the first two rows) was restored according to the copy of the database I have from November 11. Most are correct now but I think that just a few had incorrect or less than great metadata before that date.

Let me know if these corrections seem fine, and also let me know the remaining items that should be restored. Thanks!

mello99 commented 8 years ago

Hi @keithgee , thanks so much for doing the database operation and correcting the bulk of the problematic records. Of the 49 problematic records, 45 have been corrected. Of those 45 records, 3 have minor metadata issues like misspellings that I can easily correct myself.

4 records, however, still have incorrect or missing files. The wrong files were probably uploaded a while back (and, regarding the item with the missing file, someone probably forgot to upload it altogether). I've located all of the necessary files, though, so I can easily upload the correct files myself. Below is the updated College of Science spreadsheet that lists which records have been fixed and which records are outstanding.

Thanks so much for everything, @keithgee , and @amandafrench , I'll start working on the College of Engineering collection shortly.

2016_01_12_10919-5553_Mismatches.csv.zip

keithgee commented 8 years ago

@mello99, I just looked at the 4 items you listed as incorrect in the spreadsheet. This time I went all the way back to our June 16 2015 dump of the database, and the metadata and the files listed are the same as they are now. So, whatever is wrong with them isn't caused by our recent issue!

I'm not sure if you want to adjust the metadata to match the files or the files to adjust the metadata, but if these aren't the way they are when they were first deposited - and they might be - it seems much more likely that the metadata was accidentally changed in the items, instead of the files.

Looking in the logs, the Cairns collection in particular looks to have been deposited by Kimberli - you could check with her to see what the original intent was or to see if she has the deposit materials, which aren't in the usual place in the server. thanks for bearing with this process!

alawvt commented 8 years ago

The record of the Cairns batch load is in gateway1 VTechWorks/SAF_batchloads/2015_SAF_Batchload_Progress.xls, sheet 2013-2014. The batch is on gateway1 VTechWorks/SAF_batchloads/20131219_10919_24381/.

mello99 commented 8 years ago

Okay, @alawvt and @keithgee , I've uploaded the original Cairns spreadsheet to GitHub (see below).

One of the outstanding 4 items, "Eco-Ethics and Sustainability Ethics", http://hdl.handle.net/10919/25051, does not have a file because the filename was incorrectly entered in the original CSV (the file was named "cairns_ecoethics_sustainability_ethicsecoethics_sustainability_ethics2.pdf" instead of "ecoethics_sustainability_ethics2.pdf"). I will upload the correct file today, which is currently available on Gateway 1, then update you when I'm done.

Another problematic item, http://hdl.handle.net/10919/25016, was also incorrectly listed on the original spreadsheet. The metadata, not the file, is duplicative and is the main culprit here. I think that whoever completed the original spreadsheet accidentally entered the same metadata twice for 2 different records. I will also fix this today, then update you.

The third problematic item, "Eco-ethics and the Biosphere", http://hdl.handle.net/10919/25015, is another case of incorrect metadata having been entered on the original Cairns spreadsheet (attached below). The filename associated with this record is the same as what's currently in VTW, but whoever created the metadata entered the wrong information. I will fix this today and then update you.

The last outstanding record, "Targeting folded RNA: A branched peptide boronic acid that binds to large surface area of HIV-1 RRE RNA", http://hdl.handle.net/10919/51701, is a WOS upload that appears to have had the wrong file uploaded with it (probably by me, since this was done in April 2015). I doubt that this was a batch upload since many of the WOS collections were quite small. Since this is an isolated incident, I have no reason to believe that it's related to the earlier mix-up. I will fix this today, and then update you.

Manuel is currently checking the College of Engineering for issues, and I will likely join him since there are over 2,000 records. I hope this information helps, and take care.

20131219_10919_24381.csv.zip

mello99 commented 8 years ago

Fixed the 4 outstanding files: http://hdl.handle.net/10919/25015, http://hdl.handle.net/10919/25016, http://hdl.handle.net/10919/51701, and http://hdl.handle.net/10919/25051.

keithgee commented 8 years ago

Great work finding and fixing that, @mello99 ! Now it's better than before! ;)

mello99 commented 8 years ago

Okay, @amandafrench, @alawvt, and @keithgee, based on Manuel's preliminary search of 100 College of Engineering items (out of 2,375), the College of Engineering community is a mess too (see the Google Sheets tracking document). Here's a link to the College of Engineering spreadsheets on Google Drive - the folder contains the original and imported spreadsheets. My initial thought is that some of the rows got mixed around, like the College of Science spreadsheet did, and that we're going to have to re-import metadata from Keith's early November database. Thanks all.

mello99 commented 8 years ago

Hi @amandafrench and @keithgee , attached is a spreadsheet that lists the items in the College of Engineering that were uploaded after November 1, 2015. There are only 7 items altogether; hopefully I didn't overlook anything. Let me know if you need any additional information. Thanks, talk to you soon.

COE_NOV_1.csv.zip

keithgee commented 8 years ago

Thanks, @mello99. I'll work on updating my method for restoring metadata to the old version so that it can work with an entire community or collection at once (except for the items you listed), instead of a single item at a time. If that works out, I'll also need to index VTechWorks, which will be done sometime in an evening or early morning. I feel like we might be able to wrap it up this week!

mello99 commented 8 years ago

Hi @keithgee , thanks for looking this over. In accordance with Amanda's instructions during our standup conversation, I'm going to say that the COE community is ready to be rolled back. Thanks for your hard work, and take care.

keithgee commented 8 years ago

I'm working on this now. I've restored metadata for the College of Engineering to the November 11 state. I'll start re-indexing of solr search/browse/item counts soon. In the course of working on this, I see that I have wiped out metadata for item 74700 ( https://vtechworks.lib.vt.edu/handle/10919/64499). This item was uploaded to the College of Engineering today, and thus wasn't accounted for in the spreadsheet of items uploaded since November 1 that was provided to me this morning. I made a backup of the database before I deleted metadata for the community, so I can recover metadata for this item from the backup if necessary. @mello99, if this was uploaded by our team and you have a copy of the metadata, it's easier for me if you restore it from your copy. Let me know. The same goes for any items that were uploaded to the College of Engineering today and weren't accounted for on the spreadsheet.

Thanks!

mello99 commented 8 years ago

Ok, the only issue that I've encountered with the metadata thus far is that the following item - http://hdl.handle.net/10919/64499 - has been mapped to the Event Capture collection, but it doesn't appear in the recent list of items uploaded to the Event Capture collection. This is the Schnabel Engineering Lecture item that I uploaded today and whose metadata was erased (it's since been re-done). I personally don't have a problem with the collection mishap since I gave Liz the item's handle, so anyone who receives handle will be taken directly to that item, but Liz McVoy might be curious as to why the item doesn't appear in Event Capture.

keithgee commented 8 years ago

@mello99 it is re-indexing now. I've discovered a mistake I made. I accidentally typed '74839' instead of '74389 in one place in a list of the 7 item ids to preserve. I should have used copy/paste. So let me know if I need to restore metadata for the item at http://hdl.handle.net/10919/64211, or if you can handle it. Sorry!

keithgee commented 8 years ago

Thanks for checking. I am hoping that the re-indexing will also make the item appear in Event Capture. It may take most of the night for re-indexing to finish. Also, I opened a separate issue #169 for the first item on Manuel's preliminary search - it looks to be a different, display-related problem.