LibraryCarpentry / lc-open-refine

Library Carpentry: OpenRefine
https://librarycarpentry.org/lc-open-refine/
Other
52 stars 134 forks source link

Improve wording describing what happens when you switch from Row to Record view in OpenRefine #329

Open ostephens opened 9 months ago

ostephens commented 9 months ago

How could the content be improved?

The wording under the images illustrating the difference between row and record layout doesn't currently make complete sense (as I read it). It says:

Note in the images above the difference between: Rows with the same Title appear below each shared title, interrupted the numbered sequence in the third column from the left. Shared titles have the same shading, which may be very difficult to distinguish visually, so look for each star and flag in the leftmost columns, which indicates a new row, that is an item with a different author.

I think this needs re-writing as I can't currently understand what it means. I think it needs to be more clearly linked to the description of what a Row is vs what a Record is in OpenRefine so that this is much clearer overall

Which part of the content does your suggestion apply to?

https://librarycarpentry.org/lc-open-refine/03-working-with-data.html#rows-and-records

jas58 commented 9 months ago

Sure, I'm a bit confused, too. The opening paragraph reads: "OpenRefine has two modes of viewing data: ‘Rows’ and ‘Records’. At the moment we are in Rows mode, where each row represents a single record in the data set - in this case, an article. In Records mode, OpenRefine can link together multiple rows as belonging to the same Record. Rows will be assigned to Records based on the values in the first column. "

The second sentence tells me a row is a record. (!)

Then, a record is a record to which many rows may be assigned.

Perhaps pull from the documentation: "A row is a series of cells, related horizontally."
[ ]Then, "When a cell has many values (multiple authors, eg.) we need to split them. This creates a record, that is: multiple related rows."
[ ]or Then "In OpenRefine, we can switch to "Records" view to split those overstuffed cells' values into separate cells of unique information, while the shared information remains constant."

Sharing the example of Show: Actor : roles might be wise, unless a FRBR version : Work: Version: Manifestation, like Othello: Film/play/tv series: directors or date.

The key seems to be Records view allows some additionally subgrouping (filtering?) but only a few additional, so splitting the cells to get Tidy Data would be a better practice. Also there's a risk in Records row of deleting data when removing an empty cell's whole row (that is row in both visual and OR terms?).

Does part of this go under Transformation (later)?

ostephens commented 9 months ago

The second sentence tells me a row is a record. (!)

I definitely see what you mean! But ... essentially this is true. To be clear. If we have a data set formatted like:

Article Author 1 Author 2
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini Angelo Plastino
Aflatoxin Contamination of the Milk Supply Naveed Aslam Peter C. Wynn

Then each row represents a single article metadata record - these are two separate articles being described, one row each. The downside is we have the author information split across multiple columns, and if we encounter an article with 3 (or 4 or 5 etc.) authors, we'll need to add a new column for every additional author.

However if we layout the same data like this:

Article Authors
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini
  Angelo Plastino
Aflatoxin Contamination of the Milk Supply Naveed Aslam
  Peter C. Wynn

Now each article metadata record takes up multiple rows. It's 2 row for each here, but if we had an article with more authors we'd just add the extra rows for that particular article metadata - and it keeps all the author data in a single column.

However in a spreadsheet (and in OpenRefine Rows mode) while using the second format makes sense to our eyes (maybe) the software has no idea that the two (or more) rows are connected - so an operation like a sort on the author column would reorder the rows with no care that each group of rows representing a single article metadata record should be kept together.

This is where OpenRefine Records mode comes in. When you switch to Records mode, OpenRefine will interpret these multiple rows as being part of the same single record still - and so will keep them together at all time. This way you get the advantages of the simpler layout, with all the author data in a single column, without losing the ability to keep all the data for a single article together.

NB its not just sort that's affected - we can manipulate the Record in a variety of ways, but sort is a simple example of why it's important that OpenRefine treats the group of rows as a single record

Does that make sense?

jas58 commented 9 months ago

Your example helps me put on computer lenses instead of human goggles, which is useful to help me remember I don't need to understand it, but the computer does (harking back to tidy data). Should we format that into the paragraph?

On Thu, Nov 23, 2023, 5:43 AM Owen Stephens @.***> wrote:

The second sentence tells me a row is a record. (!) I definitely see what you mean! But ... essentially this is true. To be clear. If we have a row like:

Article Author 1 Author 2 The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini Angelo Plastino Aflatoxin Contamination of the Milk Supply Naveed Aslam Peter C. Wynn

Then each row represents a single article metadata record - these are two separate articles being described, one row each. However if we layout the same data like this: Article Authors The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini Angelo Plastino Aflatoxin Contamination of the Milk Supply Naveed Aslam Peter C. Wynn

Now each article metadata record takes up multiple rows (it's 2 for each here, but if we had more authors we'd need more rows per record).

In a spreadsheet (and in OpenRefine Rows mode) using the second format makes sense to our eyes (maybe) but the software has no idea that the two (or more) rows are connected - so an operation like a sort on the author column would reorder the rows with no care that each group of rows representing a single article metadata record should be kept together. However in OpenRefine Records mode, OpenRefine will understand that these grouped rows are part of the same record - and so will keep them together at all time.

Does that make sense?

— Reply to this email directly, view it on GitHub https://github.com/LibraryCarpentry/lc-open-refine/issues/329#issuecomment-1824282401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMT5MX6KFFLIA67HM3BLDYLYF4ZHNAVCNFSM6AAAAAA7WOWUHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGI4DENBQGE . You are receiving this because you commented.Message ID: @.***>

ostephens commented 9 months ago

@jas58 I'll make a PR based on what I've written above - I think I can probably still make some improvements! Once I've got a PR ready then I'll ask you to review and you can check it both makes sense and is helpful!

jas58 commented 8 months ago

With your closed PR, does that mean this issue is also closed? @ostephens If not, which element should I edit into the final checkbox? This seems related to issue 264 about rows expanding?

ostephens commented 7 months ago

Discussed in call on 9th Feb. Use TidyData inspired example to show how e.g. single title with multiple authors would work as tidy data (repeating title where necessary) and then how OpenRefine can use empty spaces in the title column to group the rows as a record - essentially replicating the tidy data approach (somewhat) without repeating the title

Potentially an exercise using blank down/fill down could be added, but concern that this will over load the learners early in the session

jas58 commented 7 months ago

Starting notes to open later: In rows mode, each row is computed independently. So when we sort by column each row is sorted independently. Think of MARC or bib records

Whereas in Records mode, sometimes it does not look different.

Need to create library specific tidy data demo graphic (well sorted and jumbled tidy data of MARC record (multi author or subj)) Ti, Au, but subj or no? \have empty fields Make 2 or 3 tables label each table view Records vs Rows modes

how to say record, row in data vs OpenRefine " the word line?" "horizontal group?"

And the nice part is, you haven't ruined the original

instructor note: if you catch yourself saying row when you mean record. please stop and restart the whole because a quick switch is gobsmackingly confusing to the new learner

How to format a table in markdown: https://carpentries.github.io/sandpaper-docs/episodes.html#tables