[RFC0020]: Layout annotating tool

ta4tsering commented 1 year ago

Housekeeping

ALL BELOW FIELDS ARE REQUIRED

Named Concepts

prodigy: annotating tool layout analysis model: model which detects different layout or components in a image OCR: Optical character recognition

Summary

Making an Instance for the annotating the layout of bdrc images using prodi.gy

Reference-Level Explanation

In order to get diverse images for annotating:
- Pick 10 collections (coherent set of images of a similar style)
- Download images and prepare thumbnails
- Annotator will select ~100 interesting samples
- Selected images will be prepare to load in prodigy
We should launch a new instance for layout analysis, which means:
- a new systemd service
- a new configuration file
- a new recipe
- a new nginx configuration on the server + a new ssl certificate
Annotator will annotate the images
ML Engineer will train one model per collection and one model with all data combined
ML Engineer sends images for reviewing --> streamed to Prodigy
Re-train, test etc until model performs very well
Move on to next collections

Our annotating UI will be like this:

recidpe

Alternatives

Confirm that alternative approaches have been evaluated and explain those alternatives briefly.

Rationale

Why the currently proposed design was selected over alternatives?

What would be the impact of going with one of the alternative approaches?

Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?

Drawbacks

Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?

Useful References

Describe useful parallels and learnings from other requests, or work in previous projects.

What similar work have we already successfully completed?: we have made an instance to crop bdrc images.

Is this something that have already been built by others?

What other related learnings we have?

Are there useful academic literature or other articles related with this topic? (provide links)

Have we built a relevant prototype previously?

Do we have a rough mock for the UI/UX?

Do we have a schematic for the system?

Unresolved Questions

What is there that is unresolved (and will be resolved as part of fulfilling this request)?

Are there other requests with same or similar problems to solve?

Parts of the System Affected

Which parts of the current system are affected by this request?

What other open requests are closely related with this request?

Does this request depend on fulfillment of any other request?

Does any other request depend on the fulfillment of this request?*

Future possibilities

How do you see the particular system or part of the system affected by this request be altered or extended in the future.

Infrastructure

Describe the new infrastructure or changes in current infrastructure required to fulfill this request.

Testing

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Documentation

Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.

Version History

v.0.2

Recordings

Meeting minutes (@eric86y, @eroux , @ngawangtrinley , @ta4tsering , @kaldan007 )

Experiment with Google Colab for batches of around 1k
1 GPU for 100k images takes 24h
Google Colab Pro with 500 credits is good if train twice a month
4 GPUs for OCR would be comfortable
Transfer in AWS is free, need to investigate if running it somewhere else is cheaper

Work Phases

We should launch a new instance for layout analysis, which means:

[x] a new systemd service
[x] a new configuration file
[x] a new recipe
[x] a new nginx configuration on the server + a new ssl certificate
- estimated time: 2 hours
- time taken:

Non-Coding

Keep original naming and structure, and keep as first section in Work phases section

[ ] Planning
[ ] Documentation
[ ] Testing

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

kaldan007 commented 1 year ago

@eroux regarding the order @ngawangtrinley is suggesting to collect from all the work in bdrc

kaldan007 commented 1 year ago

@ta4tsering we need to filter only tibetan work. @eroux is it possible to get from the ttl

eroux commented 1 year ago

I think a better system would be to have a good balance between:

manuscripts in Uchen
manuscripts in Ume
manuscripts from Dunhuang
Tibetan prints
Chinese prints
Mongolian prints
Buryat prints
Khmer manuscripts (both short and long palm leaves)
Burmese manuscripts
modern prints in book format
modern prints in pecha format
manuscripts from Nepal

perhaps starting with 500 of each. If / when it finishes, then we can start thinking of the next steps. That's the kind of proposal I was expecting...

kaldan007 commented 1 year ago

@eroux Can i know why do we need burmese?

eroux commented 1 year ago

because we want to do the exact same thing (layout detection, OCR, OCR cleanup, etc.) for all languages. It's a comparatively minor cost with a lot of potential benefits

ta4tsering commented 1 year ago

Can you suggest a way to classify the works into those above mentioned types of prints or manuscript ? can we get that from ttl file of the work ? for example, for modern print I can use this bdo:printMethod bdr:PrintMethod_Modern from the ttl.

eroux commented 1 year ago

basically you have to understand that the end goal is not annotations for annotations' sake. The goal is to have a dataset that we can use to train a model that will do some layout detection. Producing the best dataset for such a model should be the number 1 requirement. Now, there are several ways of creating such a dataset, and I feel that right now we haven't even touched the question of what we wanted. I think it would be reasonable to have in fact 3 datasets:

himalayan pecha that could contain:
- block prints (Tibetan, Chinese, Mongolian, Buryat)
- manuscripts (Tibetan, Mongolian, Nepalese)
- modern editions in pecha format, ideally from different regions
modern books
palm leaves manuscripts (that could be trained on Nepalese, Khmer and Burmese palm leaves manuscripts)

I can work on something like that, it's not particularly straightforward but it can be ready in a few days

eroux commented 1 year ago

constituting this dataset is part of the work, and people in different domains should be consulted:

AI experts
persons who know the BDRC database
persons who will take general policy decisions for the project

the work consists in:

organizing communication between these stakeholders
producing a document recording the policy decisions
producing scripts to get the right data
using the right data in the recipe

ngawangtrinley commented 1 year ago

For layout detection I guess we can do everything regardless of the language. For OCR we will have to limit ourselves to Tibetan since that's what the grant is for.

eroux commented 1 year ago

sure, OCR for non-Tibetan script is out of the scope

eroux commented 1 year ago

(BTW, the estimation of the rest of the work are totally off I think, it can take just 2h total, the juicy part is the dataset)

ngawangtrinley commented 1 year ago

@eroux this RFC is not finalized! It is in the process of "request for comments" and we are consulting you and other specialists before finalizing the work plan and starting to code. Please don't hesitate to give feedback and opinion and we'll integrate it in our plan.

eroux commented 1 year ago

as requested, here are a few collections that I think could be good:

W26071 (Zhol)
W3CN20612 (Dege)
W1PD96685 (Cone Kangyur)
W1KG26108 (Chinese print)
W14322 (Chinese print, different layout)
W29468 (Mongolian print)
W1KG89102 (Buryat print)
W1PD100944 (modern pecha print)
W1KG14700 (clear manuscript)
https://library.bdrc.io/search?r=bdr:PR1TIBET00&t=Scan&s=title%20forced (old Bodong prints, the NGMPP title cards should be removed)

that's all I can think of right now, but don't hesitate to add to it!

kaldan007 commented 1 year ago

@eroux thank you for the list

kaldan007 commented 1 year ago

@eroux is it a good idea to zip the images and save in a github directory and share the zip file link to annotator?

eroux commented 1 year ago

this will be too big for github, I think OpenPecha needs a way to provide files online. S3 is a good solution for that, you can upload a zip on s3 an send a URL with a token that's valid 1 week or so. This works well at scale

kaldan007 commented 1 year ago

ok

eroux commented 1 year ago

Also, unrelated to the thumbnails issue, we need to determine the URL scheme of the various instances of Prodigy. Currently the instance for cropping is on https://prodigy.bdrc.io, but where would be put the instance for layout? Perhaps having a schema like prodigy.bdrc.io/{recipe_name}/ (so https://prodigy.bdrc.io/bdrc_crop/ and https://prodigy.bdrc.io/layout_analysis/) would be good?

kaldan007 commented 1 year ago

@eroux I agree with naming. @ta4tsering what do you think

ta4tsering commented 1 year ago

yeah, the naming sounds good. I have just changed the port in the nginx configuration file layout_analysis.conf and prodigy configuration file layout_analysis.json to port 8090 and restarted the server to test and it works fine, now I will look into URL schema.

eric86y commented 1 year ago

Have you decided upon a layout scheme for annotating images, i.e. which elements you will annotate, etc.? Moreover, can Prodigy handle multi-class annotation/tagging, so that you can also add a script tag to an annotated line?

eroux commented 1 year ago

@eric86y see https://github.com/OpenPecha/prodigy-tools/issues/10 for the current schema. @kaldan007 perhaps this could be integrated in the RFC?

eric86y commented 1 year ago

No lines?

kaldan007 commented 1 year ago

@eric86y are we suppose to include line? we were think it as separate. Can it be together? If yes we will update the UI accordingly

eric86y commented 1 year ago

I think it depends on how the annotation pipeline is organized, if you split this then it is ok. But for doing OCR that is based on lines, you'll need robust line detection as well. This can theoretically be done by another team. I was just considering tagging the script type in the process to train a script classification down the road.

eroux commented 1 year ago

I think the plan is to have different steps:

one pass to detect pages only
one pass to detect layout features only
one pass to detect lines of the main text area

this would be done in 3 different prodigy instances

kaldan007 commented 1 year ago

I think the plan is to have different steps:

one pass to detect pages only

one pass to detect layout features only

one pass to detect lines of the main text area

this would be done in 3 different prodigy instances

yes we are planning to go this way

eroux commented 1 year ago

@kaldan007 just as a general remark: the RFC is 40% filler (everything in italic), 60% information, perhaps we could either transform the filler into information or remove it? it will make this RFC thing much more appealing (it currently feels like a normal github issue + some unnecessary copy/paste of a template to make it look professional)

My main point though is the following: please create the zip files for the collection on an AWS EC2 instance in the us-east-1 zone (like the prodigy server) so that we can minimize the transfer cost. Thanks!

kaldan007 commented 1 year ago

@eroux i hundred percent agree with you. @ngawangtrinley we definitely need to simplify a bit. I am getting complains from our team also.

Regarding zip image collection, @ngawangtrinley says he has images of those collection in hard drvie. So rather than downloading we thought to make the thumbnails from his system. each collections will be having one repo with an empty folder named unique_images where annotator are suppose to copy paste the unique images. the thumnail zip file will added to the release of that repo. the downloadable link will be added to the readme of that repo with instructions to select images.

eroux commented 1 year ago

well, what we need is an automated pipeline that we can trigger, relying on NT maybe having some images on his hard drive may work for the first 10 collections but it doesn't look like this can be used in an automatic workflow... that's just my 2c

kaldan007 commented 1 year ago

Shall we do the mentioned procedure for first 10 collections. by the time our annotators are occupied we can assign the fully automated workflow to a developer. What do you think? @eroux

eroux commented 1 year ago

sure!

OpenPecha / prodigy-tools