This PR changes the metadata representation. The previous style was based on ad-hoc Dublin Core style fields, which are commonly used in DSpace systems from where the documents were harvested. But there are a lot of ad-hoc fields and extensions so most fields aren't really standard Dublin Core. This felt needlessly complicated and sometimes hard to work with. So this PR changes the metadata representation to a more domain-specific, human readable style without referring to DC (except by sharing some terminology). Here is an example of new-style metadata:
On the one hand, this PR reduces the currently represented metadata fields from 40+ to 10. On the other hand, the 10 are more than the 7 Meteor-supported fields that were previously used with Meteor itself and LLM fine-tuning experiments (the new ones are p-isbn and p-issn, i.e. standard identifiers for the printed version of a publication, and type_coar, the document type according to the COAR Resource Types classification).
In the future, the schema should be expanded to cover more fields which are still available in the Google Sheets but not currently represented in the JSONL files. But the new metadata needs to be thoroughly checked so that it matches what is actually stated in the PDF publications, so that we have a genuine "ground truth" for those.
This PR also changes the evaluation code in eval.py. The new code is more field-specific, better structured, and there are some generic comparison functions such as _compare_simple_string and _compare_set that can be used for several fields. The evaluation scores changed a bit in some cases due to the changes; for example ISBNs and publishers are now compared as sets rather than only considering the first values. In general the scores still remain roughly the same as before. The NousHermes-Mistral-7B-DPO model scores around 0.9 on average and Meteor around 0.67.
This PR changes the metadata representation. The previous style was based on ad-hoc Dublin Core style fields, which are commonly used in DSpace systems from where the documents were harvested. But there are a lot of ad-hoc fields and extensions so most fields aren't really standard Dublin Core. This felt needlessly complicated and sometimes hard to work with. So this PR changes the metadata representation to a more domain-specific, human readable style without referring to DC (except by sharing some terminology). Here is an example of new-style metadata:
On the one hand, this PR reduces the currently represented metadata fields from 40+ to 10. On the other hand, the 10 are more than the 7 Meteor-supported fields that were previously used with Meteor itself and LLM fine-tuning experiments (the new ones are
p-isbn
andp-issn
, i.e. standard identifiers for the printed version of a publication, andtype_coar
, the document type according to the COAR Resource Types classification).In the future, the schema should be expanded to cover more fields which are still available in the Google Sheets but not currently represented in the JSONL files. But the new metadata needs to be thoroughly checked so that it matches what is actually stated in the PDF publications, so that we have a genuine "ground truth" for those.
This PR also changes the evaluation code in
eval.py
. The new code is more field-specific, better structured, and there are some generic comparison functions such as_compare_simple_string
and_compare_set
that can be used for several fields. The evaluation scores changed a bit in some cases due to the changes; for example ISBNs and publishers are now compared as sets rather than only considering the first values. In general the scores still remain roughly the same as before. The NousHermes-Mistral-7B-DPO model scores around 0.9 on average and Meteor around 0.67.