Suggestions for annotating a new corpus

AngledLuffa commented 1 year ago

I am wondering, what advice is there for starting a new corpus? Is there a guide for doing so?

There is a team in Pakistan at Isra University who would like to see more Sindhi NLP tools in Stanza, and one of the ways we could make that happen is by annotating more raw data. (Currently there is not very much Sindhi in UD.) We're able to find an annotation company with some amount of linguistics knowledge in Sindhi, and of course there are people at Isra would put together a schema, review annotations, and possibly do some annotation as well.

There's already some tokenized data, so that should be taken care of. I believe the next step would be to label it with POS and dependencies.

Would it make sense to:

come up with an initial schema for dependencies, possibly with some sentences analyzed
pass this guide to the annotators with a portion of the data
see what comes back, correct errors when possible
use this to produce silver dependencies which the annotation team can correct, hopefully making the task easier

Are there better approaches for getting annotators who may not be very familiar with dependencies to label things? For example, I could also imagine breaking sentences into phrases and then trying to describe the relations between phrases as an easier approach for getting high quality annotations. That almost sounds like constituencies, for that matter, so perhaps it would be easier to build a constituency dataset and convert that to dependencies in some way.

@muteeurahman

dan-zeman commented 1 year ago

There is already a Sindhi dataset in the UD Github by @mazharaliabro. It has never been released primarily because it does not have dependencies. But it is 675 sentences / 6863 tokens with UPOS tags and some features. I suppose someone could use it to train a tagger and apply it to the new data. It should be checked whether the tokenization is compatible.

Regarding dependencies, I imagine that a parser based on XLM Roberta (it seems to contain Sindhi) and a mixture of existing UD treebanks (in the spirit of Udify) could produce something that the annotators could use.

With unexperienced annotators it may be even more advisable to implement a language-specific validator that will check patterns that the universal validator cannot check.

dan-zeman commented 1 year ago

starting a new corpus? Is there a guide for doing so?

Yes, there is this. But every language is special and there are huge differences in what resources already exist and can be potentially used.

meesumalam commented 1 year ago

@AngledLuffa I am working on UD for Saraiki language which is closely related to Sindhi.

I am a PhD student in computational linguistics at Indiana University, and would be happy to share my thoughts in this project. thanks

AngledLuffa commented 1 year ago

@dan-zeman Thank you for the link and the suggested starting point. I would worry about how much Sindhi data is really in XLM - looking over other multilingual transformers which include Sindhi, they generally have very little raw text. The idea of knowledge transfer from an existing language is an interesting one.

We had noticed the unfinished Sindhi dataset. I'm not sure what the current expectation is in terms of how finished we think the upos tagging & featurization is. Depending on how much we want to use it, there may already be enough to start a tagger. Not having dependencies will be a bit of a limitation at first, I would expect.

@meesumalam Thank you for the suggestion. Would it make sense to connect you directly with @muteeurahman ? I am curious what you've found in terms of raw text for annotating or building language models, especially if you've come across such data in Sindhi. There is a limited amount of data in the common crawl or Wikipedia for Sindhi, and I would expect even less for Saraiki (I don't see it listed in the Oscar version of CC, for example)

meesumalam commented 1 year ago

Right, Saraiki doesn't have much data as compared to Sindhi.

You reach me out at meealam@iu.edu for further discussion on the topic.

Thank

muteeurahman commented 1 year ago

@meesumalam Thanks for your suggestions, as Saraiki is closely related to Sindhi we can have similar issues like crossing or nonprojective dependencies, feature complexities due to pronominal suffixes etc. Let us come to the point of these problems we will have interesting discussions there. Dr. Tafseer told me about someone working on Saraiki Dependencies most probably he was talking about you.

On Thu, 23 Nov 2023 at 18:36, meesumalam @.***> wrote:

Right, Saraiki doesn't have much data as compared to Sindhi.

You reach me out at @.*** for further discussion on the topic.

Thank

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/993#issuecomment-1824451326, or unsubscribe https://github.com/notifications/unsubscribe-auth/A27MQKVR32TIADVWDNZR4STYF5GM7AVCNFSM6AAAAAA7QUGNDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGQ2TCMZSGY . You are receiving this because you were mentioned.Message ID: @.***>

meesumalam commented 1 year ago

Yes, I had a conversation with Dr Tafseer and told him about my UD Saraiki work.

I think, it would be great if we can have a meeting via zoom to discuss and decide on complex structures of Sindhi.

Thanks, Meesum

Get Outlook for iOShttps://aka.ms/o0ukef

From: muteeurahman @.> Sent: Thursday, November 23, 2023 10:47:31 AM To: UniversalDependencies/docs @.> Cc: Alam, Meesum @.>; Mention @.> Subject: [External] Re: [UniversalDependencies/docs] Suggestions for annotating a new corpus (Issue #993)

This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.

@meesumalam Thanks for your suggestions, as Saraiki is closely related to Sindhi we can have similar issues like crossing or nonprojective dependencies, feature complexities due to pronominal suffixes etc. Let us come to the point of these problems we will have interesting discussions there. Dr. Tafseer told me about someone working on Saraiki Dependencies most probably he was talking about you.

On Thu, 23 Nov 2023 at 18:36, meesumalam @.***> wrote:

Right, Saraiki doesn't have much data as compared to Sindhi.

You reach me out at @.*** for further discussion on the topic.

Thank

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/993#issuecomment-1824451326, or unsubscribe https://github.com/notifications/unsubscribe-auth/A27MQKVR32TIADVWDNZR4STYF5GM7AVCNFSM6AAAAAA7QUGNDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGQ2TCMZSGY . You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/993#issuecomment-1824642494, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2XNAY7HUOMD5TNL3ZZJGSTYF5VZHAVCNFSM6AAAAAA7QUGNDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGY2DENBZGQ. You are receiving this because you were mentioned.Message ID: @.***>

UniversalDependencies / docs

Suggestions for annotating a new corpus #993