clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

Add data + documentation + process.py for new RFB seqtag annotation #82

Closed keighrim closed 3 weeks ago

keighrim commented 1 month ago

Because

As the new RFB model development involves a different method of LLM-based sequence tagging annotation (with human correction) to create a new training data, the new annotation should be documented and published in this repo.

Note that old RFB annotation (#65) contains fully manual annotations with OCR error correction, and was developed for evaluation purpose, hence will be kept for evaluation purpose.

Done when

Additional context

No response

wricketts commented 1 month ago

I have a block on this.

In short, the rules in the aapb-annotations repo require a specific format for gold files before being fed to a model, and we weren't following this format in the process of training our model.

Our annotation effort resulted in a raw CSV containing one row per ocr result (meaning multiple rows per guid). Our code for processing essentially turns each annotation into a sequence of strings and a sequence of tags for modeling. We then shuffled the rows, partitioned them into train/val/test splits, and converted each partition into a jsonl format that does not include the guids. These jsonls were fed to the model.

The aapb-annotations repo wants the golds directory to have one CSV per guid. While it is possible to write a new process.py that does this, the output format would no longer replicate our process (i.e. separating data points randomly as opposed to by their guid). It also wouldn't be in a format ready for model ingestion, as the readme suggests. If one wanted to retrain the model as we did, they would need to "undo" the gold formatting by putting all the data back into one file, shuffle it, and make data splits. (and I think this would defeat the purpose of having the gold format)

@keighrim @haydenmccormick Do you have any suggestions? Also I'm happy to meet to talk about this if that's easier.

keighrim commented 1 month ago

Yeah, I under stand the annotation in this project wan't done on media-file basis. I think we can omit the "gold" portion for now, but leaving a short prose in the documentation on why the gold is missing (or impossible) in this context.