This PR reorganizes the document categories. Previously, the documents were split into four categories: thes (theses), docthes (doctoral theses), serial (serial publications) and mono (monographs). In the new categorization, there are five categories: thes and docthes (as before), report, book and article.
The old categorization was unclear and some documents were not easy to fit into a single category. For example a monograph book could also be published in a series. The three new categories are all based on COAR resource types; while these are not perfect either, at least we don't have to define those classes. For example, the book category contains documents with the COAR resource types book and book part, while article contains documents with COAR resource types such as journal article, newspaper article, research article, conference paper and blog post.
The categorization is implemented in a new Google Sheets document which is used as the source for reading the original metadata in the conversion notebooks.
The number of documents has been increased from approximately 700 to 800, since it was now possible to include some documents that did not fit the old classification.
The PR contains some additional changes:
A new notebook for verifying that the metadata values are actually mentioned in the document text; currently used for ISSNs and publisher names
dc.publisher metadata has been thoroughly reviewed and a lot of changes were made, for example changing the publisher name to the form (and language) that is actually used in the publication itself. Many "implicit" publishers (that weren't directly mentioned in the text; e.g. the company that is known to publish a certain journal) have been marked in the Google sheet using [square brackets] and these are excluded when creating the JSONL files under /metadata. This makes the publisher information correspond closer to the "ground truth" that is actually available in the document.
Changes to the JSON structure: already in the files under /metadata, the actual metadata (dc.* fields) is placed under the ground_truth sub-object. Previously this was only done in later stages.
As a consequence of the many changes to the data set, the previous evaluation results are no longer valid and have therefore been deleted. New evaluations have been performed using the baseline-null method, Meteor, and Axolotl fine-tuning for the previously best performing Nous-Hermes-2-Mistral-7B-DPO model. For these methods, the evaluation metrics didn't seem to change very much; the largest differences can be seen for the dc.publisher field, whose values changed a lot.
This PR reorganizes the document categories. Previously, the documents were split into four categories:
thes
(theses),docthes
(doctoral theses),serial
(serial publications) andmono
(monographs). In the new categorization, there are five categories:thes
anddocthes
(as before),report
,book
andarticle
.The old categorization was unclear and some documents were not easy to fit into a single category. For example a monograph book could also be published in a series. The three new categories are all based on COAR resource types; while these are not perfect either, at least we don't have to define those classes. For example, the
book
category contains documents with the COAR resource typesbook
andbook part
, whilearticle
contains documents with COAR resource types such asjournal article
,newspaper article
,research article
,conference paper
andblog post
.The categorization is implemented in a new Google Sheets document which is used as the source for reading the original metadata in the conversion notebooks.
The number of documents has been increased from approximately 700 to 800, since it was now possible to include some documents that did not fit the old classification.
The PR contains some additional changes:
dc.publisher
metadata has been thoroughly reviewed and a lot of changes were made, for example changing the publisher name to the form (and language) that is actually used in the publication itself. Many "implicit" publishers (that weren't directly mentioned in the text; e.g. the company that is known to publish a certain journal) have been marked in the Google sheet using[square brackets]
and these are excluded when creating the JSONL files under/metadata
. This makes the publisher information correspond closer to the "ground truth" that is actually available in the document./metadata
, the actual metadata (dc.*
fields) is placed under theground_truth
sub-object. Previously this was only done in later stages.As a consequence of the many changes to the data set, the previous evaluation results are no longer valid and have therefore been deleted. New evaluations have been performed using the baseline-null method, Meteor, and Axolotl fine-tuning for the previously best performing Nous-Hermes-2-Mistral-7B-DPO model. For these methods, the evaluation metrics didn't seem to change very much; the largest differences can be seen for the
dc.publisher
field, whose values changed a lot.