NatLibFi / FinGreyLit

Data set of Finnish grey literature, containing curated Dublin Core style metadata and links to original PDF publications
18 stars 2 forks source link

Reorganize document categories (five categories based on type) #5

Closed osma closed 4 months ago

osma commented 4 months ago

This PR reorganizes the document categories. Previously, the documents were split into four categories: thes (theses), docthes (doctoral theses), serial (serial publications) and mono (monographs). In the new categorization, there are five categories: thes and docthes (as before), report, book and article.

The old categorization was unclear and some documents were not easy to fit into a single category. For example a monograph book could also be published in a series. The three new categories are all based on COAR resource types; while these are not perfect either, at least we don't have to define those classes. For example, the book category contains documents with the COAR resource types book and book part, while article contains documents with COAR resource types such as journal article, newspaper article, research article, conference paper and blog post.

The categorization is implemented in a new Google Sheets document which is used as the source for reading the original metadata in the conversion notebooks.

The number of documents has been increased from approximately 700 to 800, since it was now possible to include some documents that did not fit the old classification.

The PR contains some additional changes:

  1. A new notebook for verifying that the metadata values are actually mentioned in the document text; currently used for ISSNs and publisher names
  2. dc.publisher metadata has been thoroughly reviewed and a lot of changes were made, for example changing the publisher name to the form (and language) that is actually used in the publication itself. Many "implicit" publishers (that weren't directly mentioned in the text; e.g. the company that is known to publish a certain journal) have been marked in the Google sheet using [square brackets] and these are excluded when creating the JSONL files under /metadata. This makes the publisher information correspond closer to the "ground truth" that is actually available in the document.
  3. Changes to the JSON structure: already in the files under /metadata, the actual metadata (dc.* fields) is placed under the ground_truth sub-object. Previously this was only done in later stages.

As a consequence of the many changes to the data set, the previous evaluation results are no longer valid and have therefore been deleted. New evaluations have been performed using the baseline-null method, Meteor, and Axolotl fine-tuning for the previously best performing Nous-Hermes-2-Mistral-7B-DPO model. For these methods, the evaluation metrics didn't seem to change very much; the largest differences can be seen for the dc.publisher field, whose values changed a lot.