Open ta4tsering opened 1 year ago
@eroux regarding the order @ngawangtrinley is suggesting to collect from all the work in bdrc
@ta4tsering we need to filter only tibetan work. @eroux is it possible to get from the ttl
I think a better system would be to have a good balance between:
perhaps starting with 500 of each. If / when it finishes, then we can start thinking of the next steps. That's the kind of proposal I was expecting...
@eroux Can i know why do we need burmese?
because we want to do the exact same thing (layout detection, OCR, OCR cleanup, etc.) for all languages. It's a comparatively minor cost with a lot of potential benefits
Can you suggest a way to classify the works into those above mentioned types of prints or manuscript ? can we get that from ttl file of the work ? for example, for modern print I can use this bdo:printMethod bdr:PrintMethod_Modern
from the ttl.
basically you have to understand that the end goal is not annotations for annotations' sake. The goal is to have a dataset that we can use to train a model that will do some layout detection. Producing the best dataset for such a model should be the number 1 requirement. Now, there are several ways of creating such a dataset, and I feel that right now we haven't even touched the question of what we wanted. I think it would be reasonable to have in fact 3 datasets:
I can work on something like that, it's not particularly straightforward but it can be ready in a few days
constituting this dataset is part of the work, and people in different domains should be consulted:
the work consists in:
For layout detection I guess we can do everything regardless of the language. For OCR we will have to limit ourselves to Tibetan since that's what the grant is for.
sure, OCR for non-Tibetan script is out of the scope
(BTW, the estimation of the rest of the work are totally off I think, it can take just 2h total, the juicy part is the dataset)
@eroux this RFC is not finalized! It is in the process of "request for comments" and we are consulting you and other specialists before finalizing the work plan and starting to code. Please don't hesitate to give feedback and opinion and we'll integrate it in our plan.
as requested, here are a few collections that I think could be good:
that's all I can think of right now, but don't hesitate to add to it!
@eroux thank you for the list
@eroux is it a good idea to zip the images and save in a github directory and share the zip file link to annotator?
this will be too big for github, I think OpenPecha needs a way to provide files online. S3 is a good solution for that, you can upload a zip on s3 an send a URL with a token that's valid 1 week or so. This works well at scale
ok
Also, unrelated to the thumbnails issue, we need to determine the URL scheme of the various instances of Prodigy. Currently the instance for cropping is on https://prodigy.bdrc.io, but where would be put the instance for layout? Perhaps having a schema like prodigy.bdrc.io/{recipe_name}/
(so https://prodigy.bdrc.io/bdrc_crop/
and https://prodigy.bdrc.io/layout_analysis/
) would be good?
@eroux I agree with naming. @ta4tsering what do you think
yeah, the naming sounds good. I have just changed the port in the nginx configuration file layout_analysis.conf
and prodigy configuration file layout_analysis.json
to port 8090
and restarted the server to test and it works fine, now I will look into URL schema.
Have you decided upon a layout scheme for annotating images, i.e. which elements you will annotate, etc.? Moreover, can Prodigy handle multi-class annotation/tagging, so that you can also add a script tag to an annotated line?
@eric86y see https://github.com/OpenPecha/prodigy-tools/issues/10 for the current schema. @kaldan007 perhaps this could be integrated in the RFC?
No lines?
@eric86y are we suppose to include line? we were think it as separate. Can it be together? If yes we will update the UI accordingly
I think it depends on how the annotation pipeline is organized, if you split this then it is ok. But for doing OCR that is based on lines, you'll need robust line detection as well. This can theoretically be done by another team. I was just considering tagging the script type in the process to train a script classification down the road.
I think the plan is to have different steps:
this would be done in 3 different prodigy instances
I think the plan is to have different steps:
- one pass to detect pages only
- one pass to detect layout features only
- one pass to detect lines of the main text area
this would be done in 3 different prodigy instances
yes we are planning to go this way
@kaldan007 just as a general remark: the RFC is 40% filler (everything in italic), 60% information, perhaps we could either transform the filler into information or remove it? it will make this RFC thing much more appealing (it currently feels like a normal github issue + some unnecessary copy/paste of a template to make it look professional)
My main point though is the following: please create the zip files for the collection on an AWS EC2 instance in the us-east-1
zone (like the prodigy server) so that we can minimize the transfer cost. Thanks!
@eroux i hundred percent agree with you. @ngawangtrinley we definitely need to simplify a bit. I am getting complains from our team also.
Regarding zip image collection, @ngawangtrinley says he has images of those collection in hard drvie. So rather than downloading we thought to make the thumbnails from his system. each collections will be having one repo with an empty folder named unique_images
where annotator are suppose to copy paste the unique images. the thumnail zip file will added to the release of that repo. the downloadable link will be added to the readme of that repo with instructions to select images.
well, what we need is an automated pipeline that we can trigger, relying on NT maybe having some images on his hard drive may work for the first 10 collections but it doesn't look like this can be used in an automatic workflow... that's just my 2c
Shall we do the mentioned procedure for first 10 collections. by the time our annotators are occupied we can assign the fully automated workflow to a developer. What do you think? @eroux
sure!
Table of Contents
Housekeeping
[RFC0020]: Layout annotating tool
ALL BELOW FIELDS ARE REQUIRED
Named Concepts
prodigy: annotating tool layout analysis model: model which detects different layout or components in a image OCR: Optical character recognition
Summary
Making an Instance for the annotating the layout of bdrc images using prodi.gy
Reference-Level Explanation
In order to get diverse images for annotating:
We should launch a new instance for layout analysis, which means:
Annotator will annotate the images
ML Engineer will train one model per collection and one model with all data combined
ML Engineer sends images for reviewing --> streamed to Prodigy
Re-train, test etc until model performs very well
Move on to next collections
Our annotating UI will be like this:
Alternatives
Confirm that alternative approaches have been evaluated and explain those alternatives briefly.
Rationale
Drawbacks
Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?
Useful References
Describe useful parallels and learnings from other requests, or work in previous projects.
Unresolved Questions
Parts of the System Affected
Future possibilities
How do you see the particular system or part of the system affected by this request be altered or extended in the future.
Infrastructure
Describe the new infrastructure or changes in current infrastructure required to fulfill this request.
Testing
Describe the kind of testing procedures that are needed as part of fulfilling this request.
Documentation
Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.
Version History
v.0.2
Recordings
Meeting minutes (@eric86y, @eroux , @ngawangtrinley , @ta4tsering , @kaldan007 )
Work Phases
We should launch a new instance for layout analysis, which means:
Non-Coding
Keep original naming and structure, and keep as first section in Work phases section
Implementation
A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.