Combine CSV files, remove duplicates

libchris commented 8 years ago

Have two CSV files, never borrowed and never browsed. want to create a file that shows those books that are never borrowed or browsed. So ... Presumably I can combine using functions shown in wk2 ... Then use open refine to find the duplicates. Then a little unsure how I then remove one of the duplilcates, so all titles are shown, but one of duplicates is removed

ostephens commented 8 years ago

@libchris I think you've got the right approach here - but first a question - is there a common value in both files (e.g. a system number, or even just the title?)

libchris commented 8 years ago

If true duplicates should have same accession number which is recorded in the files

libchris commented 8 years ago

Safer than title, as potentially two unique books could have same title

ostephens commented 8 years ago

Thanks @libchris.

You could take the approach you describe - combine the two files outside OpenRefine, import the combined file, sort by the Accession number column and (this is important for later) 'Reorder permanently' (an option that you access from the 'Sort' drop down menu which will display once you have applied a sort).

You could do a 'Facet by Duplicate' on the Accession number column - this will give you all the lines in the file that are repeated in the file. You then have a few options:

If the Facet by Duplicate gives you what you want, you are done and no more work to do!
If the rows are complete duplicates another option at this point is to export the duplicate rows, and then use the 'uniq' command in a Unix command line (see lesson 2 https://github.com/LibraryCarpentry/week-two-library-carpentry/blob/07d26b14dd12c03cfd71aa20bc582c5a3fe6d1d6/lesson-materials/LibCarp-lesson-two-handout.md)
If the rows contain complementary data (i.e. there is data in one row, that isn't in the other) you've got more work to do:

To combine data from multiple rows which have the same accession number into a single row you need to convert your rows into 'records' (which allows you to have multiple rows linked as a 'record'). The way the 'records' mode works is slightly odd, but basically it relies on the first column having the 'key' to the record in it, with the first row of the record containing the key value, and the next X rows (that belong to the same record) being blank. This is probably easier to see in practice than through words:

In the Duplicates facet view, filter down to duplicates and 'Star' (or Flag) all rows (under the All menu), then remove the facet.
Move the Accession number column so it is the first column in the project
On the Accession column dropdown find 'Edit cells->Blank down' - this will remove any duplicate accession numbers and bring together the rows into a single 'record'
Switch to Record view
Then, on each of the other columns use the 'Edit cells->Join multi-valued cells' option to merge together the data that appears across rows in a single record
When you'd done this for all columns, all the data from your 'records' should be only in the top row of the record.
Switch back to Row mode, do a 'Facet by blank' on the Accession Number col, filter down to the blank ones, and remove these rows (under the 'All' menu at the head of the first col)
Finally do a 'Facet by Star' - this will display the rows that you previously found by the 'Duplicates' facet

This is clearly a bit complicated, but it is fairly mechanical and once you get used to the records/rows modes in OpenRefine relatively straightforward - there is a good introduction to this at http://kb.refinepro.com/2012/03/difference-between-record-and-row.html

LibraryCarpentry / week-four-library-carpentry--DEPRECATED

Combine CSV files, remove duplicates #11