Lookup between two data extracts

EstherArens commented 8 years ago

I need to insert a title identifier into an extract of order lines from a library system. An extract of order line data gives me an order ID as unique identifier, and for each order ID an additional extract can give me the title key. How to best?

ostephens commented 8 years ago

From what we talked about in the session I understand the data you start with looks something like:

Order ID: 123456
Order number: PO03453
Order date: 2015-11-01

and

Order ID: 123456
Title ID: 98765

Is that about right?

EstherArens commented 8 years ago

Thanks for picking this up again, Owen. Really appreciated ☺

The other file (keys2insert.txt) has one line per ‘record’ i.e. 3540638172 |1|DWL|9899|PO-95|

The common identifier in both is the order ID (here “PO-95”). Via this I need to take the values of the first two columns of keys2insert.txt and insert them into the matching record in orders2migrate.txt. And these two values should be preceded respectively by .CATALOG#. |a .ORDERLINE_KEY. |a

Caveat: Not all orders (records in the first file) still have title keys & order lines (row in the second file) – these are mostly historic orders that could never be supplied and for which the title has since been removed.

Although I’ve used OpenRefine today (already useful to just look what is in a data set ☺), I’ve not tried it for this – any hints gratefully received. Thanks in advance, Esther

From: Owen Stephens [mailto:notifications@github.com] Sent: 30 November 2015 22:52 To: LibraryCarpentry/week-four-library-carpentry Cc: Arens, Esther A. Subject: Re: [week-four-library-carpentry] Lookup between two data extracts (#6)

From what we talked about in the session I understand the data you start with looks something like:

Order ID: 123456

Order number: PO03453

Order date: 2015-11-01

and

Order ID: 123456

Title ID: 98765

Is that about right?

— Reply to this email directly or view it on GitHubhttps://github.com/LibraryCarpentry/week-four-library-carpentry/issues/6#issuecomment-160787269.

ostephens commented 8 years ago

@EstherArens probably a few approaches, but the first one that springs to mind:

Import both files into OpenRefine
- For orders2migrate.txt do this as a 'line based file' - keep all information in a single cell
- For keys2insert.txt bring in as delimited file using | as delimiter
In the orders2migrate project, use 'Add new column based on this column' to add a new column that contains just the 'data' part of each line - e.g. you might do this with something like value.match(/\.[A-Z_]+\.\s*(\|[a-z0-9])(.*)/)[1]
Filter the rows showing to only those with ".ORDR_ID." in first col. The second col should contain the order ID in each row
Do a 'cross' using the data in the new col (order ID) with the keys2insert project to retrieve the two columns
Add this onto the cell starting .ORDR_ID. with a separator not used elsewhere in these rows (i.e. not pipe - I use ~ often) with the necessary labels:
- e.g. value+"~.CATALOG#. |a"+cells["Content"].cross("keys2insert txt","Column 5")[0].cells["Column 1"].value+"~.ORDERLINE_KEY. |a"+cells["Content"].cross("keys2insert txt","Column 5")[0].cells["Column 2"].value
Use 'split multi-valued cells' in the first column - using the separator you've used in the above expression

At this point, you should have a line based file with the two new data pieces added. Remove the additional column you added, then export as some kind of delimited file (no delimiter will be used as only one col)

Does that make sense?

LibraryCarpentry / week-four-library-carpentry--DEPRECATED

Lookup between two data extracts #6