clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

repository structure in discussion #2

Closed keighrim closed 1 year ago

keighrim commented 2 years ago

An example structure we discussed last week;

.
├── scaffold.py
├── a.guidelines
│   ├── ner.md
│   └── slate.md
├── b.uploads
│   ├── other_proj_mmddyy
│   │   ├── README
│   │   ├── annotations
│   │   │   └── dump_anntatorY.ann
│   │   └── process.py
│   └── some_project_mmddyy
│       ├── README
│       ├── annotations
│       │   └── dump_annotatorX.json
│       └── process.py
└── z.golds
    ├── credits
    │   └── some_project_mmddyy
    │       ├── cpb-XXX-001.csv
    │       ├── cpb-XXX-002.csv
    │       ├── cpb-XXX-003.csv
    │       ├── cpb-XXX-004.csv
    │       └── cpb-XXX-005.csv
    ├── ner
    │   └── other_proj_mmddyy
    │       ├── cpb-YYY-000001.conllu.tsv
    │       ├── cpb-YYY-000002.conllu.tsv
    │       ├── cpb-YYY-000003.conllu.tsv
    │       ├── cpb-YYY-000004.conllu.tsv
    │       └── cpb-YYY-000005.conllu.tsv
    └── slates
        └── some_project_mmddyy
            ├── cpb-XXX-001.csv
            ├── cpb-XXX-002.csv
            ├── cpb-XXX-003.csv
            ├── cpb-XXX-004.csv
            └── cpb-XXX-005.csv

Notes in this example;

keighrim commented 2 years ago

Recent commits (up to c410d9e4f10a8de24dfb9cc98e56ca10ff912cac) pushed some "sample" files in accordance with the proposed structure.

keighrim commented 2 years ago

Initially, we thought we can automate the invocation of the process.py files using some CI tool (e.g. github actions) triggered by pushes to annotations directories. However, later we found the necessity of manual supervision of the timing of running process.py mostly because of (pseudo) version control of annotation data. Namely, we only want to run process.py and generate a set of gold data only when the annotation is "done", at least for a single round or a phase.

  1. We don't want to publish gold data for every small update in the annotation process
  2. Also by limiting the frequency of publication of gold data, we can use commit histories of subproject-specific subdirectories in golds directory as "version control" of that specific annotation subproject.

To make it easy to "anchor" the subdirectory-level commit histories, @marcverhagen suggested using a single plain text file (the proposed name is state) for each subproject, which also can be used as a change log of the gold data of each.

keighrim commented 1 year ago

Revisiting this after a year, I'd like to suggest some changes to our previous decisions;

  1. top-level directories organized by project names
  2. project names are formatted as YYMMDD-some_name-(theme1-theme2-...-)data_batch_name
    1. for future implementation of Brandeis-hosted data lake (or more like a pond actually, given the size), I believe we need to give a "name" to a specific set of GUIDs of AAPB assets we use for an annotation.
    2. date format MUST be YYMMDD to make it easily sort-able, and should come from the date of initiation of a project, not start date of actual annotation. (we should provide scaffold.sh script to make sure these points)
    3. we'll come back the what themes are, but they should be completely optional (but recommended to use for annotation project managers)
  3. under a project, we put ~uploads~ annotations(d) (renamed), golds(d), process.py, guidelines.md, README.md
    1. annotations - just a new name of uploads
      1. what we haven't thought so far is duplicate annotations and IAA and adjudication, needs more discussion.
    2. process.py - good old data converter from raw annotation to gold annotations. If could include (automatic) adjudication as well in the future.
    3. README.md - should specify;
      1. project goals and overview
      2. manager names
      3. annotator names (optional, if annotators are forced to use git to upload data, their names should be automatically tracked by git)
      4. NO guideline author names (this must NOT be manually written in the readme, but automatically tracked by git history)
      5. anything else?
    4. guidelines.md - this being under project directory means that we can end up with duplicate copies (with identical content) of guidelines.md files across multiple, say, NER annotation projects. But it will give us flexibility to make small tweaks based on the characteristics of target source data. (e.g. maybe we need to add additional labels for NE for a specific subset of AAPB data)
    5. golds - when a project produces two or more "types" of gold data (e.g. slates and bars) their should be "thematic" organization under the golds directory. Otherwise no subdirectories are needed.
  4. we, as a team, DO NOT create or maintain the registry of annotation "themes" either in evaluation repo, annotatino repo, or in the SDK. Otherwise, I think we'll end up with creating another CLAMS "eval" vocab and that will probably delay the proejct for weeks, if not months or years.
    1. Hence an annotation project can freely (with reason) come up with names for those themes (used in the long project-name top directory or in the golds/subdirectories.
    2. Instead, app developers have responsibility to research the annotation repo to find possible evaluation data and specify it (or them) in the app directory. This should be do-able as we won't have thousands of annotation datasets in the this eval repo (not to mention that we don't have resources to create thousands of annotation data)

Wans to hear about what people think. (esp. @kelleyl ) Let us know.

keighrim commented 1 year ago

description files in #16 seem to show the justification of projects being the first class citizen in this repo, as that information can be merged to the README.md file proposed in the above comment. If we want to make projects and golds equal first-class citizens, we might want to consider process.py to generate those description.{md,txt} files from README.md under the project.

One more thing to consider is the tie between this repo and the evaluation repo. The repo not only serves as a public data repository but should also serve as a source of gold data for in-house evaluators. That said, evaluators need to know

  1. where the gold files are (to calculate some evaluation metrics)
  2. what/where the raw source A/V assets under the gold annotation are (to obtain or request "system outputs" to evaluate)

The first is easy. It is a github URL of a subdirectory under this repository. But I'd like to pass the second information along with the first. So, to encode the second information with the first, we can

  1. make golds a second class under project and thus the "batch" information of the project can be programmatically retrieved.
  2. make files under a "gold" directory named after the source A/V asset. For example, for AAPB annotations, we can use AAPB GUIDs as filenames (as exemplified in the very first tree figure).
marcverhagen commented 1 year ago

Some thought on this, trying to integrate the above.

The top-level has just annotations and golds. The annotations are organized around projects and the golds around annotation types.

.
├── annotations
│   ├── projectA-themeX-themeY
│   └── projectB-themeP-themeQ
└── golds
    ├── credits
    ├── ner
    └── slate

Projects have names that should be unique (in fact, I would like the first part of the name, without the themes, to be unique) and optional themes, but no start date, if we want that recorded we put it as a metadata property inside the project. Each project consists of: metadata, a readme file, a process.py script, a guidelines directory and a list of batches.

annotations/
├── projectA-themeX-themeY
│   ├── 230510-batchname
│   │   ├── annotations
│   │   ├── metadata.yml
│   │   ├── readme.md
│   │   └── sources.txt
│   ├── 230530-batchname
│   │   ├── annotations
│   │   ├── metadata.yml
│   │   ├── readme.md
│   │   └── sources.txt
│   ├── guidelines
│   │   ├── projectA-guidelines-v1.md
│   │   └── projectA-guidelines-v2.md
│   ├── metadata.yml
│   ├── process.py
│   └── readme.md
└── projectB-themeP-themeQ
    └── readme.md

The metadata specify what annotation types are included in this project and anything else that seems relevant. Guidelines live at the project level, each batch can refer to a specific version of the guidelines. The example has versioned guidelines explicitly in the directory tree, alternatively batches could refer to a git commit. Large changes in guidelines, especially if new categories are introduced, should not be allowed in a project. A new project should be created instead.

Batches have their own metadata and readme file (division of labor between those two TBD), a sources file to specify the source files and a directory with annotations. Batch names include a starting date and something useful like "nehshour-2001-2001-25". Metadata includes annotator names and characterizations, guideline version, dates, and anything we deem relevant. The readme could be a file that is presented to users browsing annotations, but that could also be in the metadata. Names of annotation files should reflect the guids if possible, if that is the case then no sources.txt file is needed. If there is only one annotation file for a batch then sources.txt is certainly needed.

The golds are organized around annotation types (which we need to think about because that could open a can of worms), but within those type they mirror projects and batches. Annotations live in a directory inside each batch, these are NOT the same as the annotations inside the batches of projects in the annotations top-level. Metadata, readme and sources could just be copied from the projects, possibly by process.py.

golds/
├── credits
│   ├── projectA
│   │   ├── 230510-batchname
│   │   │   ├── annotations
│   │   │   ├── metadata.yml
│   │   │   ├── readme.md
│   │   │   └── sources.txt
│   │   └── 230530-batchname
│   │       ├── annotations
│   │       ├── metadata.yml
│   │       ├── readme.md
│   │       └── sources.txt
│   └── projectB
├── ner
└── slate

Annotations and everything else in these directories are generated by the process.py files from the projects. The files metadata.yml, readme.txt and sources.txt files are probably just copies, in which case we may not even want to include them because we can trace those down from the name of the project and the name of the batch.

marcverhagen commented 1 year ago

The golds are organized around annotation types (which we need to think about because that could open a can of worms)

Yes, there are issues with the above. It is rather unclear what to do with a project that mixes a bunch of annotations. In the past we have used an example where annotations in a project could be timeframes for a bunch things like credits, bars and slate. With those we assumed they could be distributed over topical areas in the golds directory, and that seems intuitive enough. But what about more complicated cases, for example for full slate annotation which has or potentially has:

These are conceptually linked and it is not obvious how we would distribute them over subdirectories of golds. Do we want to put entities somewhere in the ner subdir, separated from the information that the entity is in a slate? And what about the bounding boxes? Or the text?

It is worthwhile to consider Keigh's suggestion to have the projects be the main organizing principle.

marcverhagen commented 1 year ago

If we use projects as the top-organizing principle the repository could look like:

.
└── projects
    ├── projectA-themeX-themeY
    │   ├── 230510-batchname
    │   │   ├── annotations
    │   │   ├── golds
    │   │   │   ├── credit_frames
    │   │   │   └── named_entities
    │   │   ├── metadata.yml
    │   │   ├── readme.md
    │   │   └── sources.txt
    │   ├── 230530-batchname
    │   │   ├── annotations
    │   │   ├── golds
    │   │   │   └── bars_and_tone
    │   │   ├── metadata.yml
    │   │   ├── readme.md
    │   │   └── sources.txt
    │   ├── guidelines
    │   │   ├── projectA-guidelines-v1.md
    │   │   └── projectA-guidelines-v2.md
    │   ├── metadata.yml
    │   ├── process.py
    │   └── readme.md
    └── projectB-themeP-themeQ
        ├── metadata.yml
        └── readme.md

Using a projects directory so we have space to put something else at the top level without being drowned out by all projects. The golds are created from each batch and are local to the batch. We may still want some substructure in the golds subdir, and we probably need something about this in the metadata.

I do have some difficulty in seeing how to map pipeline evaluation to projects, but that would be an issue with the previous proposal as well.

keighrim commented 1 year ago

I'm actually against to

  1. putting a gold set under a batch (should be the other way around, putting batches under a gold)
  2. adding any substructure to golds directories.

because the complex, nested structure of golds directories will make it hard for the evaluation invoker to collect the necessary information to start an evaluation pipeline.

In essence, the invoker needs to know the locations of gold files and GUIDs of the source assets (along with the evaluation script and pipeline config).

# part of evaluation invoker

def run_eval(evaluator, pipeline_config, guids: List[str], golds: Path): -> Html
    preds = run_pipeline(pipeline_config, guids)
    return evaluator(golds, preds)

If we want to perform the evaluation on a batch-by-batch basis, evaluators still only need to know the same information but by batch.

# part of evaluation invoker

def run_eval(evaluator, pipeline_config, guids_by_batches: List[List[str]], golds_by_batches:List[Path]): -> Html
    preds_by_batches = [run_pipeline(pipeline_config, guids) for guids in guids_by_batches]
    return evaluator(golds_by_batches, preds_by_batches)

We discussed naming gold files using the AAPB-GUIDs, so in that case, the invoker doesn't even need to know the GUIDs.

# part of evaluation invoker

def run_eval(evaluator, pipeline_config, golds_by_batches: List[Path]): -> Html
    guids_by_batches = []
    for batch in golds_by_batches:
        guids_in_batch = []
        for gold_fname in batch.glob("*"):
            guids_in_batch.append(infer_guid(gold_fname))
        guids_by_batches.append(guids_in_batch)
    ...
    preds_by_batches = [run_pipeline(pipeline_config, guids) for guids in guids_by_batches]
    return evaluator(golds_by_batches, preds_by_batches)

With addition of the ~golden retriever~ 🦮 (such as https://github.com/clamsproject/consumer-evaluation/pull/7),

# part of evaluation invoker

def retrieve_golds(gold_batch_url):
    for file_url in parse_html_and_find_href_file_objs(gold_batch_url):
        download(file_url, local_gold_tmpdir)
    return local_gold_tmpdir

def retrieve_all_gold_batches(gold_url):
    return [retrieve_golds(batch_url) for parse_html_and_find_href_file_objs(gold_url)]

def run_eval(evaluator, pipeline_config, gold_url: str): -> Html
    golds_by_batches = retrieve_all_gold_batches(gold_url)
    guids_by_batches = do_filename_magic(golds_by_batches)
    preds_by_batches = [run_pipeline(pipeline_config, guids) for guids in guids_by_batches]
    return evaluator(golds_by_batches, preds_by_batches)

In this case, I don't see lots of value in adding metadata or readme or anything like that under a gold directory. In fact, adding those will only increase the complexity of the 🦮, as it then needs to know which files to avoid downloading. So I'm all in for the simplest structure under a gold batch directory to ensure straightforward, decoupled implementation and reliable operation of 🦮 .


Next, regarding sources.txt, I believe that batches should be placed under their own hive, mainly so that we can re-use the same batches over and over for different annotation tasks (as we already have been doing with NH20 set https://github.com/clamsproject/wgbh-collaboration/issues/21). (To that end, I'm leaving this for the record that yesterday we talked about marking hierarchical information between batches based on subset-superset relations)


All that being said, my proposal is something like this;

$ tree .
.
├── batches
│   ├── batchX.txt
│   ├── batchY.txt
│   ├── newshour20.txt
│   └── ...  # more_batches
├── projectA-themeX-themeY
│   ├── 230510-batchX
│   │   ├── annotations
│   │   ├── metadata.yml
│   │   └── readme.md 
│   ├── 230530-batchY
│   │   ├── annotations
│   │   ├── metadata.yml
│   │   └── readme.md
│   ├── golds-themeX
│   │   ├── batchX
│   │   │   ├── cpb-aacip-xxx-a.ann
│   │   │   ├── cpb-aacip-xxx-b.ann
│   │   │   └── cpb-aacip-xxx-c.ann
│   │   ├── batchY
│   │   │   ├── cpb-aacip-yyy-a.ann
│   │   │   ├── cpb-aacip-yyy-b.ann
│   │   │   ├── cpb-aacip-yyy-c.ann
│   │   │   └── cpb-aacip-yyy-d.ann
│   ├── golds-themeY
│   │   ├── batchX
│   │   │   ├── cpb-aacip-xxx-a.json
│   │   │   ├── cpb-aacip-xxx-b.json
│   │   │   └── cpb-aacip-xxx-c.json
│   │   ├── batchY
│   │   │   ├── cpb-aacip-yyy-a.json
│   │   │   ├── cpb-aacip-yyy-b.json
│   │   │   ├── cpb-aacip-yyy-c.json
│   │   │   └── cpb-aacip-yyy-d.json
│   ├── guidelines.md  # version controlled by git
│   ├── metadata.yml
│   ├── process.py  # ... and more python modules if needed
│   └── readme.md
... and more projects

$ cat batches/batchX.txt
cpb-aacip-xxx-a
cpb-aacip-xxx-b
cpb-aacip-xxx-c

$ cat batches/newshour20.txt
cbp-aacip-507-gx44q7rk1p
cbp-aacip-507-9882j68s35
cbp-aacip-507-2804x55077
cbp-aacip-507-7659c6sk7z
cbp-aacip-507-zw18k75z4h
cbp-aacip-507-r785h7cp0z
cbp-aacip-507-m61bk17f5g
cbp-aacip-507-bc3st7ff6r
cbp-aacip-507-zk55d8pd1h
cbp-aacip-507-vm42r3pt6h
cbp-aacip-507-k649p2wz7p
cbp-aacip-507-pc2t43js98
cbp-aacip-507-1v5bc3tf81
cbp-aacip-507-n29p26qt59    
cbp-aacip-507-4746q1t25k    
cbp-aacip-507-cf9j38m509    
cbp-aacip-525-3b5w66b279    
cbp-aacip-525-9g5gb1zh9b    
cbp-aacip-525-bg2h70914g    
cbp-aacip-525-028pc2v94s    
marcverhagen commented 1 year ago

A few things written down, partially after a short in-person discussion this afternoon.

batches directory

Yes, I am more than okay with that, but forgot to mention it in my previous comment. We can refer to a batch in batches either by including its name in the name of the batch under the project (as in the new proposal) or in the metadata of that batch, or both. Either way, we do not need sources.txt anymore.

Perhaps some subdivision inside batches would be nice, using the program name or some other organizing principle. I think in the long run we might be looking at hundreds or thousands of batches and I don't like directories with too many files. Of course, we can solve that problem when we run into it rather than now, or consider it to not be a problem.

substructure in golds directory

Most likely not needed, and having it does give us the extra task of naming those directories. But it is possible that golds in a project may be of different kinds (.ann versus .mmif versus something else, at least, we do not disallow that), so within a batch we could have multiple files for each source.

By the way, I don't quite see how that structure is such a problem for the evaluator, I would call one layer of subdirectories very complex and what needs to be done is simply a matter of finding the files with the specified guids in the golds directory, which is no more than a few lines of code.

One question on the code example

def run_eval(evaluator, pipeline_config, guids: List[str], golds: Path): -> Html
    preds = run_pipeline(pipeline_config, guids)
    return evaluator(golds, preds)

Why is the guids argument there? If we have golds then we can automatically trace that to all we need.

I don't see lots of value in adding metadata or readme or anything like that under a gold directory.

Agreed, and I don't think anyone was suggesting that if golds can be traced to the batches, which it can in the last two proposals.

golds inside of batches

I see the point, mostly from a flexibility point of view. I now think golds should not be inside of batches, nor should batches be inside of golds. Which is in line with the directory tree in the previous comment.

golds-themeX and golds-themeY

I think this is dangerous. In a previous commit the point was made that the themes were there for the joy of people creating project names, allowing you to stash information in the name, but that in no way should there be a registry or themes, nor should code be ever required to use the themes to find stuff. Similarly the themes should not be used to dictate directory structure deeper down. It also assumes that somehow the theme names map to annotation categories, or some other concept that is useful for grouping gold annotations, which we cannot do because the themes are a free for all.

Instead of

.
└── golds-themeX
    ├── batchX
    │   ├── cpb-aacip-xxx-a.ann
    │   ├── cpb-aacip-xxx-b.ann
    │   └── cpb-aacip-xxx-c.ann
    └── batchY
        ├── cpb-aacip-yyy-a.ann
        ├── cpb-aacip-yyy-b.ann
        ├── cpb-aacip-yyy-c.ann
        └── cpb-aacip-yyy-d.ann

we have

.
├── golds-batchX
│   ├── cpb-aacip-xxx-a.ann
│   ├── cpb-aacip-xxx-b.ann
│   └── cpb-aacip-xxx-c.ann
└── golds-batchY
    ├── cpb-aacip-yyy-a.ann
    ├── cpb-aacip-yyy-b.ann
    ├── cpb-aacip-yyy-c.ann
    └── cpb-aacip-yyy-d.ann

But this does not allow us to a file named cpb-aacip-xxx-a.json under two different themes so some extra structure is needed for that.

retrieving golds 🦮

Maybe we should come up with a bunch of icons like this.

a final worry

We need to think about the process of how we select gold data given a pipeline we want to evaluate. The structure of this repo should support that.

keighrim commented 1 year ago

I wasn't implying we maintain a systematic registry of "themes" in the above. When I was writing the run_eval pseudo code, I was thinking, for example;

projectX-thmX-thmY/annotations/batchX/two-different-annotations-done-in-single-pass-and-saved-in-single-file.json
projectX-thmX-thmY/annotations/batchY/two-different-annotations-done-in-single-pass-and-saved-in-single-file.json
projectX-thmX-thmY/process.py

Given these files (after an "upload" from an annotator), the next thing the annotation manager does will be

$ python projectX-thmX-thmY/process.py

And it can generate (as proposed above)

projectX-thmX-thmY/golds-thmX/batchX/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-thmX/batchY/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-thmY/batchY/cpb-xxxxx-*.json
projectX-thmX-thmY/golds-thmY/batchY/cpb-xxxxx-*.json

or

projectX-thmX-thmY/golds-just-a-name/batch{X,Y}/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-just-another-name/batch{X,Y}/cpb-xxxxx-*.json

or even simpler

projectX-thmX-thmY/golds-1/batch{X,Y}/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-2/batch{X,Y}/cpb-xxxxx-*.json

the -1, -2, -thmX, -thmY, -just-a-name, or -just-another-name suffixes in the golds-* directories don't mean much by themselves. What explicitly "explains" the content of those directories should be the README in the project root.

In fact, it can be a completely different structure. as long as we have all batches directly under one directory that can be passed to the run_eval invoker (as one single string that the gold(en) retriever can enjoy).

projectX-thmX-thmY/golds/thmX/batch{X,Y}/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds/thmY/batch{X,Y}/cpb-xxxxx-*.json
# then a gold_url value will be `projectX-thmX-thmY/golds/thmX`

# or 

projectX/X/golds-batch{X,Y}/cpb-xxxxx-*.ann
projectX/Y/golds-batch{X,Y}/cpb-xxxxx-*.json
# then a gold_url value will be `projectX/X`

# but this won't work
projectX/golds-X-batchX/cpb-xxxxx-*.ann
projectX/golds-X-batchY/cpb-xxxxx-*.ann
projectX/golds-Y-batchX/cpb-xxxxx-*.json
projectX/golds-Y-batchY/cpb-xxxxx-*.json
# gold_url can't be `projectX` (include two different data) nor `projectX/golds-X-batchX` (missing other batches in the same theme)

We need to think about the process of how we select gold data given a pipeline we want to evaluate. The structure of this repo should support that.

Agree. Right now, I imagine app developers browse the annotation repository - by carefully reading all README files, or by programmatically greping some keywords - to find a relevant dataset that they can use for their app. Maybe annotation project managers can give some hints by adding some useful keywords to the project name. Or if we publish the README files via jekyll (which I believe we will via #17 ) , we can add a search box over README files statically indexed (that's one of the beauty of static blog engines).

Once they find a dataset (suppose projectX-thmX-thmY/golds-just-another-name was a nice fit), they invoke an evaluation pipeline (by pressing an "invoker" button we haven't designed yet) with proper arguments

run_eval(evaluator=eval_name, 
         pipeline_config=conf_name, 
         gold_url="https://aapb-as-dataset.clams.ai/projectX-thmX-thmY/golds-just-another-name")
# Note that evaluators are not a part of this repository, but should come from the evaluation repo
# Also note that pipeline_config is not a part of this repository, but should inherently come from the app itself, and MUST include the app
# Namely, the dependency-wise evaluators should be completely decoupled from the app and should be able to generalize to evaluate the different apps/workflows that outputs the same types of annotation.

(I think the "button" to be a github actions workflow)


def run_eval(evaluator, pipeline_config, guids: List[str], golds: Path): -> Html

Why is the guids argument there? If we have golds then we can automatically trace that to all we need.

Yeah, that pseudo-code is before the golden retriever is introduced.

Now going back to the example with the concept of :guide_dog: , run_eval now needs to call :guide_dog: to obtain gold files and GUIDs for system predictions. And to make the life of :guide_dog: any easier, I want the structure under projectX-thmX-thmY/golds-just-another-name as simple as possible.


We need to think about the process of how we select gold data given a pipeline we want to evaluate. The structure of this repo should support that.

And by using GUIDs as file names, :guide_dog: is now free from the responsibility of obtaining the GUIDs (bound under batch names), it's only responsibility is to download the files using only a single string argument, manually "selected" by the app developer.

marcverhagen commented 1 year ago

A lot is on the process.py script I think. And if it needs to create substructure within golds then I am fine with that

What I find dangerous is to use theme names to determine the directory structure so I very much prefer the 2nd and 3rd options underneath "And it can generate (as proposed above)". For each project, the developer determines how to distribute annotations under the golds section as long as it follows a few simple rules on directory structure.

Form last week's discussion I think we agree that a fully automatic mapping from pipeline to projects and golds is not realistic, but once the evaluator (the person) has determined what projects are relevant then the evaluator (the program) should have an easy time finding the relevant gold data.