Date Normalization - Githubissues

patrickzurek commented 7 years ago

JIRA issue created by: rcook Originally opened: 2012-02-17 05:41 PM

Issue body: (nt)

patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-02-17 05:43 PM

Comment body:

Future possible work. Unclear whether this would be work for MST service clean up or handling in Drupal.

-----Original Message----- From: Bowen, Jennifer Sent: Friday, February 17, 2012 12:06 PM To: Brand, John; Cook, Randall; Kiraly, Peter Cc: Lindahl, David Subject: RE: Data errors in T0

I looked at the file that Peter attached and it looks like we have a whole lot of data that is mis-tagged. It would be nice if we could clean up some of this - Peter, is there any easy way to view the record numbers for these? Unfortunately there isn't a way in Voyager to limit to just the 260 $c so the only ones I could track down with just the text string are those that are really distinctive strings. It looks like it is a mix between local data errors and problems in records that we purchased from elsewhere.

Even if we cleaned up all of the ones that are incorrect, though, we would still have all of the Roman numerals, the ones that start "Printed by..." etc. since the 260 $c is a free text field. I was mapping that to the XC Schema instead of the coded value since I think that libraries will want to see that data. I have also thought that perhaps we could map the coded dates instead - it is on my list of possible future enhancements. Whatever we do, we would need to do some significant date normalization and enhancement, whether we take the date from where we are taking it now, or from the coded field instead.

So, let's note this as a possible future enhancement - additional date normalization.

Jennifer

-----Original Message----- From: Brand, John Sent: Thursday, February 16, 2012 9:19 AM To: Cook, Randall; Kiraly, Peter Cc: Lindahl, David; Bowen, Jennifer; Wesley, MT; Arbelo, Ralph; 'Delis, Christopher'; 'BGant@uillinois.edu' Subject: RE: Data errors in T0

For the manifestation below, > id="oai:mst.rochester.edu:MetadataServicesToolkit/marctoxctransformation/17684564", it has: dcterms:issuedc1997./dcterms:issued

Because, in the bib: oai:mst.rochester.edu:MetadataServicesToolkit/marcnormalization/8822356

New York :/marc:subfield Longman,/marc:subfield c1997./marc:subfield The marctoxctransformation service simply does this: ``` // Create an dcterms:issued based on the 260 $c values transformInto = processFieldBasic(transformMe, transformInto, 260, 'c', "issued", AggregateXCRecord.DCTERMS_NAMESPACE, null, FrbrLevel.MANIFESTATION); ``` Does this answer the question? -----Original Message----- From: Cook, Randall Sent: Thursday, February 16, 2012 8:32 AM To: Kiraly, Peter Cc: Brand, John; Lindahl, David; Bowen, Jennifer; Wesley, MT; Arbelo, Ralph; Delis, Christopher; BGant@uillinois.edu Subject: RE: Data errors in T0 Thanks, John needs to look at the attachment on your first note then. -----Original Message----- From: PÃ©ter KirÃ¡ly [mailto:kirunews@gmail.com] Sent: Thursday, February 16, 2012 8:31 AM To: Cook, Randall Cc: Brand, John; Lindahl, David; Bowen, Jennifer; Wesley, MT; Arbelo, Ralph; Delis, Christopher; BGant@uillinois.edu Subject: Re: Data errors in T0 Sorry, you are right, all the listed bibid problems are in DT. The relevant issue is with the dcterms:issued field. PÃ©ter 2012/2/16 Cook, Randall rcook@library.rochester.edu: > I think you are confusing two different types. See snippet below. > There is one bib number, the type NRU of 1535122. > > id="oai:mst.rochester.edu:MetadataServicesToolkit/marctoxctransformation/17684564" > type="manifestation"> > > ``` > 95051667 > > 33947225 > > 1535122 > ``` > > -----Original Message----- > From: PÃ©ter KirÃ¡ly [mailto:kirunews@gmail.com] > Sent: Thursday, February 16, 2012 7:35 AM > To: Brand, John; Lindahl, David; Bowen, Jennifer > Cc: Cook, Randall; Wesley, MT; Arbelo, Ralph; Delis, Christopher; > BGant@uillinois.edu > Subject: Data errors in T0

patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-02-20 07:13 PM

Comment body:

Part of one of this, below, are Roman numeral dates and need future work.

â€¢ M. DC. LXXXII. â€¢ M.DC.LXXXII. â€¢ M.DC.LXXXIII. â€¢ M. DC. LXXXIV. â€¢ M.DC.LXXXIV.]

patrickzurek commented 7 years ago

JIRA Comment by user: jbowen JIRA Timestamp: 2012-02-28 09:39 PM

Comment body:

Iâ€™ve been looking into some of the weird things that show up on this list, and forgot that there is a normalization step that pulls date ranges out of the 008 field â€“ thatâ€™s where the â€œuuuuâ€� fields are coming from! Peter, are those date ranges (when they work correctly) useful to you? This would end up being a field that has a range separated by a hyphen, e.g. â€œ1910-1955â€�.

I think we could get rid of a lot of the garbage by redefining the Normalization step to substitute spaces when it encounters â€œuuuuâ€� â€“ it is already doing that when it encounters â€œ9999â€�. And, also to not create a field when there are no numeric values at all in the field â€“ this would prevent both â€œuuuu-â€œ and â€œ-uuuuâ€�. However, we would still get cases like â€œ1994-â€œ and â€œ-1944â€� â€“ i.e. where the beginning or ending date of a range is unknown, it would still create the field, which is what we want. How does this sound? We could assign this to Chris with the other norm changes.

Peter, are you already doing something to replace â€œuâ€� characters with spaces when you encounter them within a year, e.g. â€œ195u-â€�? The normalization step doesnâ€™t deal with this either, although it couldâ€¦

Jennifer

patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-02-29 12:51 PM

Comment body:

I really would like this conversation that may lead to assigned work to be tracked in an issue instead of threads in email that can get lost.

From: PÃ©ter KirÃ¡ly [mailto:kirunews@gmail.com] Sent: Tuesday, February 28, 2012 5:09 PM To: Bowen, Jennifer Cc: Cook, Randall Subject: Re: Data errors in T0

Now the date handling function is really simple. I do not handle correctly the "-1943" or "1885-" type of dates, they became simply "1943" and "1885". Since there are lots of variation in even the normalization dates I try to extract the numbers, and apply the assumption, that there are 19?? type of dates, but no ??76. So I create 1900 from 19?? and 1990 from 199? (the question mark can be real question mark, 'u', hyphen, point etc.)

Once we had a chat, that we might fulfill the date ranges internal values, so 1992-1995 would became 1992, 1993, 1994, and 1995. There is a similar kind of solution in the CDL you reported. I guess we should first analyze different kinds of existing date format, and create a precise specification, because this topic could be very complex, unless we keep it very simple and basic.

To estimate the problem, we can start with some typical date types:

one, correctly formulated date (it is correctly handled now)
a date range, like 1867-1923
a date range with open end, like 1867-, or -1923. How we should handle this? Can we automatically add a ending/beginning date for the unknown part? Is there any other data inside the record that would help us in this estimation (for example if it is a journal, it is more probable, that it was published regularly, up untill now, than a book series)? If no such data, can we setup a default range, like a decade (so we handle "1867-" as it would be "1867-1877"). Another problem, that records with long date ranges will occur more frequently in hit list when filtered by date facet, than those ones with exact dates, which could be a misleading thing.
uncertain dates like 189u. We can handle this as 1890-1899.

For the problematic points, I guess it would be fruitful to attach some use cases, describing different search situations.

So the conclusion:

now we have a basic function, which can not handle date ranges, and use the first possible date for uncertain dates
we can handle date ranges, and uncertain date, but first we should analyze the actual values we have, and setup what we want to achieve. I am not sure, that I have a really clear picture about the target we would like to achieve.

PÃ©ter

patrickzurek commented 7 years ago

JIRA Comment by user: jbowen JIRA Timestamp: 2012-02-29 02:09 PM

Comment body:

Peter, thanks for this explanation - you've laid out the situation very clearly. At some point when we are really ready to tackle this issue, we should pull all of this information about what happens now, and what we would get from the CDL algorithms, into a single document. The question I have now is, is it doing us any good to have the beginning and ending dates of a date range as part of our date facets? Or perhaps instead, is having those beginning and ending dates causing any problems? If we aren't seeing a benefit, or it is causing confusing results, then we could simply turn off the normalization step that creates those dates, and we would be left with the date info. from the MARC 260 $c only. The 260 $c may repeat that range or dates, or provide a more complex string of full text that includes dates (which you are already dealing with successfully) or it might not include anything at all. So we might miss out on some dates that way.

If we decide to KEEP mapping the beginning and ending range dates from the 008, then we need to get rid of those "uuuu-" and "-uuuu" strings, so will need to make a small change to the Normalization step.

I suspect that the best plan of attack is this, for the short term:

Continue mapping the beginning and ending dates for the ranges (with the step turned on), and monitor the date facets/do usability, etc. to investigate whether there is some downside to what we are doing now.
Make the change to the Norm step to change the "uuuu" to blanks, so that we don't create dc:issued fields for those at all
Continue with the voyager date cleanup of other stuff in Peter's log file
Move any other investigation out to the future. What do you think?

patrickzurek commented 7 years ago

JIRA Comment by user: rcook JIRA Timestamp: 2012-05-01 05:28 PM

Comment body:

Jennifer, checking that you have the uuuu steps documented elsewhere, and then going move this out of Undecided and into deferred (if there is even work we decide to do here).

eXtensibleCatalog / Drupal-Toolkit

Date Normalization #165