eXtensibleCatalog / Drupal-Toolkit

The eXtensible Catalog Drupal Toolkit
0 stars 0 forks source link

Date Normalization #165

Open patrickzurek opened 7 years ago

patrickzurek commented 7 years ago

JIRA issue created by: rcook Originally opened: 2012-02-17 05:41 PM

Issue body: (nt)

patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-02-17 05:43 PM

Comment body:

Future possible work. Unclear whether this would be work for MST service clean up or handling in Drupal.

-----Original Message----- From: Bowen, Jennifer Sent: Friday, February 17, 2012 12:06 PM To: Brand, John; Cook, Randall; Kiraly, Peter Cc: Lindahl, David Subject: RE: Data errors in T0

I looked at the file that Peter attached and it looks like we have a whole lot of data that is mis-tagged. It would be nice if we could clean up some of this - Peter, is there any easy way to view the record numbers for these? Unfortunately there isn't a way in Voyager to limit to just the 260 $c so the only ones I could track down with just the text string are those that are really distinctive strings. It looks like it is a mix between local data errors and problems in records that we purchased from elsewhere.

Even if we cleaned up all of the ones that are incorrect, though, we would still have all of the Roman numerals, the ones that start "Printed by..." etc. since the 260 $c is a free text field. I was mapping that to the XC Schema instead of the coded value since I think that libraries will want to see that data. I have also thought that perhaps we could map the coded dates instead - it is on my list of possible future enhancements. Whatever we do, we would need to do some significant date normalization and enhancement, whether we take the date from where we are taking it now, or from the coded field instead.

So, let's note this as a possible future enhancement - additional date normalization.

Jennifer

-----Original Message----- From: Brand, John Sent: Thursday, February 16, 2012 9:19 AM To: Cook, Randall; Kiraly, Peter Cc: Lindahl, David; Bowen, Jennifer; Wesley, MT; Arbelo, Ralph; 'Delis, Christopher'; 'BGant@uillinois.edu' Subject: RE: Data errors in T0

For the manifestation below, > id="oai:mst.rochester.edu:MetadataServicesToolkit/marctoxctransformation/17684564", it has: dcterms:issuedc1997./dcterms:issued

Because, in the bib: oai:mst.rochester.edu:MetadataServicesToolkit/marcnormalization/8822356

New York :/marc:subfield Longman,/marc:subfield c1997./marc:subfield The marctoxctransformation service simply does this: ``` // Create an dcterms:issued based on the 260 $c values transformInto = processFieldBasic(transformMe, transformInto, 260, 'c', "issued", AggregateXCRecord.DCTERMS_NAMESPACE, null, FrbrLevel.MANIFESTATION); ``` Does this answer the question? -----Original Message----- From: Cook, Randall Sent: Thursday, February 16, 2012 8:32 AM To: Kiraly, Peter Cc: Brand, John; Lindahl, David; Bowen, Jennifer; Wesley, MT; Arbelo, Ralph; Delis, Christopher; BGant@uillinois.edu Subject: RE: Data errors in T0 Thanks, John needs to look at the attachment on your first note then. -----Original Message----- From: Péter Király [mailto:kirunews@gmail.com] Sent: Thursday, February 16, 2012 8:31 AM To: Cook, Randall Cc: Brand, John; Lindahl, David; Bowen, Jennifer; Wesley, MT; Arbelo, Ralph; Delis, Christopher; BGant@uillinois.edu Subject: Re: Data errors in T0 Sorry, you are right, all the listed bibid problems are in DT. The relevant issue is with the dcterms:issued field. Péter 2012/2/16 Cook, Randall rcook@library.rochester.edu: > I think you are confusing two different types. See snippet below. > There is one bib number, the type NRU of 1535122. > > id="oai:mst.rochester.edu:MetadataServicesToolkit/marctoxctransformation/17684564" > type="manifestation"> > > ``` > 95051667 > > 33947225 > > 1535122 > ``` > > -----Original Message----- > From: Péter Király [mailto:kirunews@gmail.com] > Sent: Thursday, February 16, 2012 7:35 AM > To: Brand, John; Lindahl, David; Bowen, Jennifer > Cc: Cook, Randall; Wesley, MT; Arbelo, Ralph; Delis, Christopher; > BGant@uillinois.edu > Subject: Data errors in T0
patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-02-20 07:13 PM

Comment body:

Part of one of this, below, are Roman numeral dates and need future work.

• M. DC. LXXXII. • M.DC.LXXXII. • M.DC.LXXXIII. • M. DC. LXXXIV. • M.DC.LXXXIV.]

patrickzurek commented 7 years ago

JIRA Comment by user: jbowen JIRA Timestamp: 2012-02-28 09:39 PM

Comment body:

I’ve been looking into some of the weird things that show up on this list, and forgot that there is a normalization step that pulls date ranges out of the 008 field – that’s where the “uuuu� fields are coming from! Peter, are those date ranges (when they work correctly) useful to you? This would end up being a field that has a range separated by a hyphen, e.g. “1910-1955�.

I think we could get rid of a lot of the garbage by redefining the Normalization step to substitute spaces when it encounters “uuuu� – it is already doing that when it encounters “9999�. And, also to not create a field when there are no numeric values at all in the field – this would prevent both “uuuu-“ and “-uuuu�. However, we would still get cases like “1994-“ and “-1944� – i.e. where the beginning or ending date of a range is unknown, it would still create the field, which is what we want. How does this sound? We could assign this to Chris with the other norm changes.

Peter, are you already doing something to replace “u� characters with spaces when you encounter them within a year, e.g. “195u-�? The normalization step doesn’t deal with this either, although it could…

Jennifer

patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-02-29 12:51 PM

Comment body:

I really would like this conversation that may lead to assigned work to be tracked in an issue instead of threads in email that can get lost.

From: Péter Király [mailto:kirunews@gmail.com] Sent: Tuesday, February 28, 2012 5:09 PM To: Bowen, Jennifer Cc: Cook, Randall Subject: Re: Data errors in T0

Now the date handling function is really simple. I do not handle correctly the "-1943" or "1885-" type of dates, they became simply "1943" and "1885". Since there are lots of variation in even the normalization dates I try to extract the numbers, and apply the assumption, that there are 19?? type of dates, but no ??76. So I create 1900 from 19?? and 1990 from 199? (the question mark can be real question mark, 'u', hyphen, point etc.)

Once we had a chat, that we might fulfill the date ranges internal values, so 1992-1995 would became 1992, 1993, 1994, and 1995. There is a similar kind of solution in the CDL you reported. I guess we should first analyze different kinds of existing date format, and create a precise specification, because this topic could be very complex, unless we keep it very simple and basic.

To estimate the problem, we can start with some typical date types:

For the problematic points, I guess it would be fruitful to attach some use cases, describing different search situations.

So the conclusion:

Péter

patrickzurek commented 7 years ago

JIRA Comment by user: jbowen JIRA Timestamp: 2012-02-29 02:09 PM

Comment body:

Peter, thanks for this explanation - you've laid out the situation very clearly. At some point when we are really ready to tackle this issue, we should pull all of this information about what happens now, and what we would get from the CDL algorithms, into a single document. The question I have now is, is it doing us any good to have the beginning and ending dates of a date range as part of our date facets? Or perhaps instead, is having those beginning and ending dates causing any problems? If we aren't seeing a benefit, or it is causing confusing results, then we could simply turn off the normalization step that creates those dates, and we would be left with the date info. from the MARC 260 $c only. The 260 $c may repeat that range or dates, or provide a more complex string of full text that includes dates (which you are already dealing with successfully) or it might not include anything at all. So we might miss out on some dates that way.

If we decide to KEEP mapping the beginning and ending range dates from the 008, then we need to get rid of those "uuuu-" and "-uuuu" strings, so will need to make a small change to the Normalization step.

I suspect that the best plan of attack is this, for the short term:

  1. Continue mapping the beginning and ending dates for the ranges (with the step turned on), and monitor the date facets/do usability, etc. to investigate whether there is some downside to what we are doing now.
  2. Make the change to the Norm step to change the "uuuu" to blanks, so that we don't create dc:issued fields for those at all
  3. Continue with the voyager date cleanup of other stuff in Peter's log file
  4. Move any other investigation out to the future. What do you think?
patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-05-01 05:28 PM

Comment body:

Jennifer, checking that you have the uuuu steps documented elsewhere, and then going move this out of Undecided and into deferred (if there is even work we decide to do here).