acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
426 stars 283 forks source link

Add language field for papers? #794

Open nschneid opened 4 years ago

nschneid commented 4 years ago

Though the papers in the Anthology are predominantly in English, there are a number of papers in languages such as French, Chinese, and Nordic languages. Sometimes the title is suffixed with the name of the language in brackets, but not always.

Wouldn't it be good to have a metadata field specifying the language? Citation managers such as Zotero already support a language field. Because the range of languages that the papers are written in is limited, heuristic language ID should be easy.

knmnyn commented 4 years ago

Hi @nschneid glad to see this dialog and push! Steven Bird, the creator of the Anthology, had always wanted per-publication metadata information on two aspects of language:

Marco Lui and Tim Baldwin have the langid.py library that they had written up in ACL 2012 that would be good as starting point. What do you think?

It was suggested at that time that we do this using ISO 639 language codes. We introduced the square bracket notation with the introduction of some of the ROCLING proceedings around 2008, but this was meant to be a stopgap solution and is a bit unsatisifying.

nschneid commented 4 years ago

Thanks for the context!

  • which languages were being targeted in the paper (e.g., which language was the corpus/copora in?)

I agree that this would be great, but it would be challenging, especially for papers that use large multilingual corpora and do not list the languages. It seems to me like a research project to develop this capability (if there has been such research I haven't seen it). It could involve looking at citations to corpus resources, for example. Whereas identifying the language the paper is written in should be fairly trivial.

nschneid commented 4 years ago

This paper looks at language mentions in the Anthology: https://arxiv.org/abs/2004.09095

mbollmann commented 4 years ago

That would be a cool feature! For papers that already have that information in the title (example), we could simply move this into metadata fields; i.e. we could also move the English translation of the title into a new metadata field.

mjpost commented 4 years ago

I'd love to have this. It would fit well with metadata of other types, such as best paper awards (#240).

The real question is getting the work done, but it would be prudent in the meantime to get the technical piece in place. At the very least, people could then commit manual annotations as they come across them.

Should we use separate yaml files (e.g., data/yaml/tags/language.yaml), or annotate the XML? My inclination is the XML. We could define separate label tags (e.g., <language>) or a more general <label type=language> tag.

mbollmann commented 4 years ago

Should we use separate yaml files (e.g., data/yaml/tags/language.yaml), or annotate the XML? My inclination is the XML. We could define separate label tags (e.g., <language>) or a more general <label type=language> tag.

I think it's a good rule of thumb to keep everything that clearly attaches to a specific volume or paper in the XML. And +1 for <language> etc., because we'll maintain a clearly defined set of metadata fields, i.e. don't need to account for arbitrary label "types".

akoehn commented 4 years ago

Yes, either make it a tag (<language>deu</language>) or an attribute of the paper: <paper language="deu" ...>

As nobody specified this, I hope we will use ISO 639-3.

mjpost commented 4 years ago

There is an xsd:language constraint, but it allows any ISO 639 variant. But we could adopt the three-letter codes by convention.

knmnyn commented 4 years ago

Hi all:

I think that is a good idea. I think we need a proper tag that allows multiple tagging as some papers have sections in different languages (code-switched examples). We should also have proper guidelines what needs to be tagged. E.g., are we facilitating recall or precision for retrieval?

Cheers,

Min

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Fri, Apr 24, 2020 at 5:37 AM Matt Post notifications@github.com wrote:

There is an xsd:language constraint, but it allows http://www.datypic.com/sc/xsd/t-xsd_language.html any ISO 639 variant. But we could adopt the three-letter codes by convention.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/794#issuecomment-618685116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABU72YUHQQSEDWKODE2H23ROCYH3ANCNFSM4MOGNJTQ .

mjpost commented 4 years ago

This is now available with LREC 2020. Please start adding language tags (using the ISO 639-2 language code) as you see fit!

davidweichiang commented 1 year ago

In #2243 @mjpost writes:

Here are all the language codes we've used.

de-DE deu eng fra no-NO pt-BR zho

It seems we have a mix of BCP-47 and ISO 639-2/3, but all of our ISO codes are identical between the two standards.

Do we really need country codes? And should we standardize on 2 vs 3 letters?

nschneid commented 1 year ago

UPDATE: Increased estimates with case-insensitive search

Distribution of the <language> tags in current data:

$ egrep -h '<language' *.xml | sort | uniq -c | sort -rn
   1740       <language>fra</language>
   1406       <language>eng</language>
    307       <language>zho</language>
      6       <language>deu</language>
      2       <language>pt-BR</language>
      2       <language>no-NO</language>
      1       <language>de-DE</language>

In some of the older venues, there is a convention of marking non-English titles with "[In LANGUAGE]":

$ egrep -i -h -o '\[in <fixed-case>.*\]' *.xml | sort | uniq -c | sort -rn
    455 [in <fixed-case>F</fixed-case>rench]
    408 [In <fixed-case>C</fixed-case>hinese]
     40 [in <fixed-case>P</fixed-case>ortuguese]
     34 [In <fixed-case>S</fixed-case>wedish]
     33 [In <fixed-case>N</fixed-case>orwegian]
     32 [In <fixed-case>D</fixed-case>anish]
     22 [In <fixed-case>P</fixed-case>ortuguese]
      1 [in <fixed-case>C</fixed-case>hinese]
      1 [In <fixed-case>G</fixed-case>erman]
      1 [In <fixed-case>E</fixed-case>nglish]

The files matched above are:

F{12-14}.xml
O{00-18,88,90-99}.xml
W{11-14,17,77,79,81,83,85,87,91,93,99}.xml

Of these two sets, W91.xml appears to be the only overlap.

$ egrep '<language' W91.xml | sort | uniq -c | sort -rn
      2       <language>no-NO</language>
      1       <language>de-DE</language>
$ egrep -i -h -o '\[In <fixed-case>.*\]' W91.xml | sort | uniq -c | sort -rn
      2 [In <fixed-case>N</fixed-case>orwegian]
      1 [In <fixed-case>S</fixed-case>wedish]
      1 [In <fixed-case>G</fixed-case>erman]
      1 [In <fixed-case>D</fixed-case>anish]

Thus, combining the two heuristics and removing the double-counted cases from W91, we get the following estimates:

Language Est. Count
French 2195
Chinese 716
Portuguese 68
Swedish 34
Norwegian 33
Danish 32
German 7

We do not list English because we assume that an overwhelming majority of the remaining items are in English. (There are 85324 <title> instances—these apply to papers but not volumes, which have <booktitle>.)

nschneid commented 1 year ago

Should we remove the "[in French]" designations from the title when adding the language code?