NationalLibraryOfNorway / metametrics

Evaluation metrics for automatic metadata extraction
1 stars 0 forks source link

ISBN and ISSN types/tags (printed vs. electronic) #1

Open osma opened 11 months ago

osma commented 11 months ago

Currently ISBNs (and later in the document also ISSNs) are defined as a set of (ID, tag) pairs: https://github.com/NationalLibraryOfNorway/metametrics/blob/e5034285d3ceb4baae5de8ca425b35319599e71b/categories.md?plain=1#L13

I'm wondering if it would be simpler to think of these instead as distinct fields. So a publication could have these four fields:

Each of these would just be a single value; comparisons and error categories would then be much simpler to define.

There are some exceptional publications like this one that have more than one of the same type of identifier (this was jointly published by two universities that both gave it printed and electronic ISBNs, so it has four ISBNs!), but that's so rare that I'm not sure it's worth taking it into account.

pierrebeauguitte commented 11 months ago

That's a good point. Considering sets is only useful for fields of variable length (in the case of Meteor, that's only "authors").

That being said, it might still make sense to consider both fields (e- and p- versions) jointly, for example:

Intuitively, I would consider result (1) better than result (2), in the sense that both IDs are correct, only the tags are wrong. From a pragmatic viewpoint (I'm thinking of our semi-automated cataloguing practice, using results as suggestions), it would be much faster to correct a tag, or swap values, than entering a corrected ISBN. Also from a retrieval perspective, it might be better to have the right value in the wrong field than indexing a wrong value (search on ISBN=3333 should not match).

And which is better between result (1) and result (3)?

In any case, a pair would be a more appropriate data structure than a set.

osma commented 11 months ago

I see what you mean. For my current analysis, I have defined these categories for the e-ISSN field:

  1. Correct answer: Identical e-ISSN value in both result in dataset
  2. Correct answer: Empty result, publication has no e-ISSN according to dataset
  3. Correct(?) answer: Result is the correct p-ISSN, but according to the dataset, the publication has no e-ISSN, so it's the best we've got
  4. Semi-correct answer: Result is the correct p-ISSN, but the publication does have an e-ISSN too which would have been a better result
  5. Wrong answer: Empty result even if the publication has an e-ISSN in the dataset
  6. Wrong answer: Result is an ISSN even if the publication does not have any ISSNs according to the dataset
  7. Wrong answer: Result is an ISSN that is neither the correct e-ISSN nor p-ISSN according to the dataset

Here categories 1-2 are clearly right while 5-7 are clearly wrong. 3 and 4 are somewhat debatable but I think 3 is at least not worse than 4. It happens quite often when organizations forget to apply for an electronic ISSN for their series and just keep using their old p-ISSN even for online publications.

We can define the same kind of categories for p-ISSN, just swapping between the e-ISSN and p-ISSN types in the definitions. And the same logic can be used for ISBN types as well.

Looking at your example:

in dataset: e-ISBN = 1111, p-ISBN = 2222 result (1): e-ISBN = 2222, p-ISBN = 1111

e-ISBN: category 4 (semi-correct) p-ISBN: category 4 (semi-correct)

result (2): e-ISBN = 2222, p-ISBN = 3333

e-ISBN: category 4 (semi-correct) p-ISBN: category 7 (wrong)

result (3): e-ISBN = 1111, p-ISBN = 3333

e-ISBN: category 1 (correct) p-ISBN: category 7 (wrong)

It's not possible to say whether result(1) is better than result(3), it depends on what kind of scores are assigned for the semi-correct categories. Trying to come up with a single number metric is always a simplification of reality, so I think it's good to be able to define and look at qualitative categories as well, and maybe compare their proportional shares.