LibraryCarpentry / week-four-library-carpentry--DEPRECATED

Week Four lesson
http://librarycarpentry.github.io/city-november-2015/
8 stars 6 forks source link

Combine CSV files, remove duplicates #11

Open libchris opened 8 years ago

libchris commented 8 years ago

Have two CSV files, never borrowed and never browsed. want to create a file that shows those books that are never borrowed or browsed. So ... Presumably I can combine using functions shown in wk2 ... Then use open refine to find the duplicates. Then a little unsure how I then remove one of the duplilcates, so all titles are shown, but one of duplicates is removed

ostephens commented 8 years ago

@libchris I think you've got the right approach here - but first a question - is there a common value in both files (e.g. a system number, or even just the title?)

libchris commented 8 years ago

If true duplicates should have same accession number which is recorded in the files

libchris commented 8 years ago

Safer than title, as potentially two unique books could have same title

ostephens commented 8 years ago

Thanks @libchris.

You could take the approach you describe - combine the two files outside OpenRefine, import the combined file, sort by the Accession number column and (this is important for later) 'Reorder permanently' (an option that you access from the 'Sort' drop down menu which will display once you have applied a sort).

You could do a 'Facet by Duplicate' on the Accession number column - this will give you all the lines in the file that are repeated in the file. You then have a few options:

To combine data from multiple rows which have the same accession number into a single row you need to convert your rows into 'records' (which allows you to have multiple rows linked as a 'record'). The way the 'records' mode works is slightly odd, but basically it relies on the first column having the 'key' to the record in it, with the first row of the record containing the key value, and the next X rows (that belong to the same record) being blank. This is probably easier to see in practice than through words:

This is clearly a bit complicated, but it is fairly mechanical and once you get used to the records/rows modes in OpenRefine relatively straightforward - there is a good introduction to this at http://kb.refinepro.com/2012/03/difference-between-record-and-row.html