INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
106 stars 52 forks source link

Make metadata section of .blf.yaml configuration file more intuïtive #385

Closed timjzee closed 1 year ago

timjzee commented 1 year ago

In the .blf.yaml configuration files it is currently counterintuïtive to list multiple individual metadata fields. For example, using the Brown test corpus:

metadata:
  # documentPath is //TEI
  containerPath: .

  fields:
  # document id in the teiHeader
  - name: docId
    valuePath: .//idno

  # <text> tag has a decls attribute we want to index as category
  - name: category
    valuePath: .//@decls

Using this syntax, I only end up with the correct docId metadata in the indexmetadata.yaml file. If I switch the order, I only get the category metadata.

To get both docId and category, I have to do this:

metadata:
  # documentPath is //TEI
  - containerPath: .

     fields:
     # document id in the teiHeader
     - name: docId
       valuePath: .//idno

  - containerPath: .

     fields:
     # <text> tag has a decls attribute we want to index as category
     - name: category
       valuePath: .//@decls

At least to me this was confusing because the documentation does show it is possible to have both an individual metadata field as well as forEachPath fields under a single containerPath.

jan-niestadt commented 1 year ago

This is a bug; both your examples should do the exact same thing. I think I've found and fixed the issue, see 7ff4ad2170c5c1d49e02ce7f5769a5db3ef6d437; could you retry with the dev branch?

timjzee commented 1 year ago

I checked using the dev branch. Works fine now, thanks for the quick fix!

jan-niestadt commented 1 year ago

No problem, thanks for reporting!