RichardLitt / low-resource-languages

Resources for conservation, development, and documentation of low resource (human) languages.
Creative Commons Attribution Share Alike 4.0 International
389 stars 56 forks source link

Endangered Languages organization #36

Closed RichardLitt closed 9 years ago

RichardLitt commented 9 years ago

I've been thinking of opening up an organization for open source software to reflect this list.

Instead of only forking projects, it would provide a place for people to upload projects that they don't want to upload to their personal accounts (for instance, in case they don't have one) and for mirroring SourceForge and other non-GitHub hosted code. We could also mirror SourceForge // other code ourselves, if the license on the product allows. It would be something like OpenSourceFieldLinguistics.

What do you think? Does that sound like a good plan?

Possible drawbacks:

RichardLitt commented 9 years ago

Justification

Currently, this repository is a list of links. This has some major drawbacks - first, link rot means it requires a deal of maintenance. Second, a lack of volunteers means I have to maintain PRs myself. Third, I can delete it, and we'd end up with stale repositories without up-to-date lists if forkers kept their version, unless they updated theirs frequently. I don't like this.

An organization would stop link rot by allowing us to copy resources in themselves. It would promote stale code by needing maintenance in regards to fetching upstream versions - but, to be honest, this is still better than just a list, because at least the resource isn't gone, we have a copy of it. It would allow people to actively search the code on-site, as well, and for us to maintain a copy of resources that otherwise might go down if an institutional website or funding runs out. Finally, I wouldn't be the only maintainer, but I could more easily add people to help fork stuff.

Lingomat commented 9 years ago

Following Lauren's Tweet... this is an interesting issue when I've been thinking about lately. Hugh Paterson made an astute observation I think:

“Many of these [linguistic] software development teams do not take the approach that potential software users coming to their website want to be oriented to how these software solutions work together to solve specific problems in the language documentation problem space.”

So without wanting to pour cold water on anything, I wonder the extent of demand for a repository of somewhat orphaned software? Unless, of course, whomever produced the tools or is using the tools (and ideally a combination of both) is able to frame an entry for a tool in these terms.

I'll go a bit further. I think a shorter curated list of software with adequate context explaining how one might use the software is more useful than a long list of software without context. I suppose you could draw a parallel to a language archive with/without metadata :)

RichardLitt commented 9 years ago

You're, of course, spot on, and I've thought of these things before as well. To the first point, I'm not a software development team, and I'm not building tools - the documentation of the software isn't a task I'm taking on. All I'm trying to do is point users towards it. I think that orphaned software is better than no software, though, so I would like to save any software that is available for a specific language or for a specific task, if possible. I would point to the original makers, if they exist, in the repository Readme.

A shorter, curated list isn't the point of this repository, necessarily. What I was looking for, here, was a way to know what is actually out there. Before I started this list, I had no idea that there was paradoxically so much and so little actually available - by creating the list, I know more now than I did before.

The ultimate goal is not the list. It is useful for one of the aims - educating people about what resources are in fact out there, and providing an easily searchable reference to them. The ultimate goal is a resource - a book, website, or something similar - that can be used to take a language from nothing to something, and for broader awareness of OS resources and for better allocation (aka, more extensible) of funds towards language communities. The next step is to split the list up into categories, as I've included as an issue here.

For now, though, some resources might be lost themselves. So, I'm thinking of building an organization. I think that is a separate issue than the points you raise, which I hope I've answered. Make sense?

Lingomat commented 9 years ago

I'm still a little unclear. The ultimate goal seems reasonable but it might help to clarify the audience?

Providing a list (of links to resources) serves the purpose of aiding the discovery of software tools in this space. It's not clear to me what is gained by hosting software, other than creating a degree of confusion as far as versions and potentially obfuscating what is current and what is abandonware. I'd argue link rot is generally a side effect and not the main problem itself.

Is orphaned software really better than no software? Some sorts of tools are more resilient but there have of course been many arguments made about the dangers of obsolete software. Bird & Simons 2003 in the language space in particular.

HughP commented 9 years ago

@RichardLitt I'm not sure that your list is really going to take off. People need more than lists - they need relevance. If you are proposing something like homebrew for linguistics software or PRAAT scripts or ELAN scripts, that might actually take off, because you would be creating an engagement point and a marketplace. The ability to add new projects to a repo is something that LinguistList's software list never had. For some the http://lingtransoft.info/ website serves to pull diverse technology information together. So, I am not sure why you would's work with Doug Higby (site moderator) to be able to quickly add content to the site (perhaps even via gists or github lists). In general I think the linguistics community does need more than a list it needs a scientific script and data repo, where if I want to replicate a published phonetics experiment, I could go grab the same scripts used in that experiment and apply it to my own data.

@Lingomat All things considered, it is kinda funny to see my blog post cited... its great to see that people actually read that stuff :-)

RichardLitt commented 9 years ago

@Lingomat I've been thinking about how best to define my audience, too. I started this because there wasn't a list of open source resources available anywhere. I hadn't thought too far beyond that. I think of it as more of an index than an intelligent list, and I think because of that defining the audience for this list (as opposed to my wider aims) may not be too necessary.

Hosting code serves the purpose of saving code that may otherwise be destroyed if the original resource is taken offline, and serves the purpose of having that code available on GitHub - a non-institutional repository that generally works well for the open source community.

I don't think loss of knowledge is ever useful. I don't see any arguments about obsolete software in Bird and Simmons 2003 - I do see arguments against obsolete proprietary formats, but that doesn't mean we should throw out the baby with the software. Some of the original algorithms, contexts, example data, etc, may be useful to someone in the future. We can't know a priori.

@HughP I don't think you're understanding the purpose of the list, or possibly I wasn't clear enough that the organization I'm proposing making doesn't necessarily come from the same place as this list. This list is an index of available resources. It is not a meter of relevance, not homebrew, not GitHub. I don't have any valuable metrics in place to know what take off would even mean, and I don't suspect I will at any point in the future. The original point of this repository was to provide a list of open source technologies to do with language, so that I could know what is out there and highlight for other people that open source is important. If one person has learned something, than it is a success. Perhaps I'm confused as to what you mean by 'take off'.

Something like homebrew for linguistics software, or Praat scripts, or ELAN scripts would be really useful. And it's something I've been debating making for years. The question I am trying to pose in this issue is: would it be useful to mirror code that may be in danger of dying in a GitHub organization? Would it be useful to put that code on GitHub?

Some organizations, such as PLoS, are working right now on better scientific code repositories. I recently attended the Workshop for Sustainable Software in Science (details), where a lot of us talked about just this. I'm not sure trying to fix that problem is within the scope of my abilities at the moment. However, forking and mirroring repositories is.

What do you guys think?

RichardLitt commented 9 years ago

Oh, and thanks so much for commenting here, both of you. :) I really appreciate this discussion!

rtxanson commented 9 years ago

If anyone would like to contribute to a homebrew for language technology stuff, feel free to join in here: https://github.com/rtxanson/language-technology. Take up discussion in issues there though.

cesine commented 9 years ago

I'm in the field right now, so I was kind of in the wrong time zone to chime in on this thread, but basically I think we should all meet in person!

My 2-cents: I'm worried that chatting via text might result in a debate when we all kind of have similar interests, but dissimilar background knowledge and experiences.

Suggestion: Let's get modern! We should all meet virtually in an On Air Google Hangout, it looks like around the time that everyone was commenting is a good time! We can meet-minds better by simply introducing ourselves in a real time round table :) Anyone else who wants to come (I tagged some folks) can join us if they are available in that time slot, or watch the broadcast later when they have time.

Why:

@RichardLitt if you like this suggestion, and you know how to make a google on air hangout, can you post the link to one here and schedule it for when you think is best? or i can probably do it for us if you haven't used it before...

HughP commented 9 years ago

I use hangouts all the time, Just let me know when and where. BTW: I'll be in Hawai'i here in a few weeks, if anyone else is going. Presentation on Keyboards and Archiving Lexicons.

cesine commented 9 years ago

@HughP awesome! i read some of your blog today, very cool. i think @jrwdunham will be there too, you guys should go to lunch together and talk about UX! joel is working on the Cleaning, Organizing and Uniting Linguistic Databases digging into data project.

http://icldc4.weebly.com/uploads/2/4/9/6/24963413/icldc4_papers_12_11_14.pdf

GARDEN LANAI ROOM
(3.2.4) LingSync: web-based software for language documentation
Joel Dunham • jrwdunham@gmail.com
The University of British Columbia
Jessica Coon • jessica.coon@mcgill.ca
McGill University
Alan Bale • alancbale@gmail.com
Concordia University
We present LingSync, a suite of open source web-based applications that facilitate collaborative
linguistic fieldwork and language documentation. The features of LingSync were designed by
theoretical linguists and people involved in language revitalization. The result is an exciting tool
that contributes to language documentation, revitalization, and linguistic analysis.
Topic area: Data management
(3.3.6) Assessing the difficulty of the text input task for minority languages
Hugh Paterson • hugh@thejourneyler.org
University of North Dakota
Jon Wilkes • jon.jwilkes@gmail.com
Independent Researcher
How do people in your language type or text? Are the difficulties due to the orthography or are
they due to the text input method? We propose and discuss a framework for analyzing the text
input experience of minority languages.
Topic area: Orthography design
HughP commented 9 years ago

I also have a poster session. @jrwdunham can you PM me... My "dance card" is filling up... :-)

Lingomat commented 9 years ago

Hello again, I'll also be in Hawai'i.

(2.7.2) Towards Language Documentation 2.0: Imagining a crowdsourcing revolutionMat Bettinson • mat@plothatching.com
University of Melbourne
Crowdsourcing offers the potential to scale documentary activity beyond the confines of ‘expert’
linguistic resources. We argue that Web 2.0-like evolution in language documentation is
necessary and even inevitable. This has deep ramifications for the design of tools and methods
and forces us to re-evaluate a number of key assumptions.
Topic area: Enriching theory, practice, and application
HIBISCUS BALLROOM 2

Really looking forward to Hawai'i. Hangout sounds awesome, I'd love to know more about what you're doing. Lingsync has created a bit of buzz around here. I'm fascinated by OpenSourceFieldLinguistics.org in general. Particularly looping in the different array of stakeholders.

RichardLitt commented 9 years ago

I think having a virtual meeting is a good idea. I'm OK doing a video broadcast. Here's the link: https://plus.google.com/events/c6mrehiut1vt6lbng91esqochq8

Regarding a good date, does 2pm PST on Saturday work for everyone? If not, does anyone want to volunteer to make a doodle poll about it?

HughP commented 9 years ago

I am in for 2PM PST. BTW I am PST in Eugene, OR.

RichardLitt commented 9 years ago

Topics:

Please suggest more if you have some. Would be good to have an attack plan.

HughP commented 9 years ago
cesine commented 9 years ago

sounds good :)

jrwdunham commented 9 years ago

@HughP if you still have some free time in Manoa, let me know. It would be great to chat. My schedule is still pretty open. Let me know what works for you.

HughP commented 9 years ago

@jrwdunham Let's totally do it. Dinner one night might be an option too. But I need a way to contact you. I have my phone, but if you are not a USA phone # then, that might not be an option... can we tall on the google call?

RichardLitt commented 9 years ago

Had a good chat with @cesine, and decided that we're going to go ahead and make an organization for mirroring dead code. If anyone would like to comment, I'll leave this issue open for a few days. Our discussion can be watched here: https://www.youtube.com/watch?v=WMToYUX_bX0&feature=youtu.be

HughP commented 9 years ago

SOO sorry guys (@cesine and @RichardLitt ).Thanks for recording it. I ran 10 miles this morning and just kinda have zonked out and really forgot... I was sitting here at my computer... just forgot till 2 hours past. - Yes I did watch the video. I have seen a lot of people call these languages without technical resources as Under Resourced Languages, so I would encourage a move that way and away from endangered languages. I am cool with the meta-repo and a new organization. The Bot is also a great idea. Though I am not sure how mirroring works on github, it sounds better than forking. So you both are aware, my background comes in different than both of you. While I have done a little network admin and a bit of my own sysadmin, these are not my suite. I do user experience design and and interaction design. My linguistics background is varied with field linguistics and language documentation in Mexico and Nigeria.

RichardLitt commented 9 years ago

Happens, don't worry about it. Well done running, that's awesome.

Under resourced is good, but I think low resource is better. That's what I've heard used the most. I used endangered languages at first because it is catchy.

Mirroring is essentially the same as forking; you can only fork on GitHub, whereas mirroring is copying code from offsite. We'd be doing both. A bot would help keep code on GitHub from being stale pretty easy, and we could also do one to test off site.

Exact name suggestions? I think github.com/LowResourceLanguages should work, but I think it's a bit long.

HughP commented 9 years ago

Well @RichardLitt, google each term first, because I get a lot of hits on `Under resourced languages' including the following:

But I see your point at 475,000 versus 344,000 hits for each term. From my Google hits it seems that CS and Governments(DARPA) use the term "low" whereas in language documentation we use the term "under". The question I guess comes to where do we draw the line between "low" v.s "High" and "under" vs. "sufficient", and then are all "low" languages also "under"? And which term communicates to the broader audience, I mean if we choose one term and we are wanting to market towards CS people then the CS term makes a lot of sense. However, if we want CS people to meet LangDoc people in the LangDoc circle then it might not make sense... It really goes back to goals.

RichardLitt commented 9 years ago

I didn't say it was more common, just that I'm more familiar with it. They're basically synonymous. (wow, I totally should have been at that LREC workshop). You'll notice the description for this repository: "A list of resources for conservation, development, and documentation of endangered, minority, and low or under-resourced human languages." That's how I dealt with it, before. I come from a CS and not a LangDoc background, although I think they're not as far apart as people think they are and I've done work in both camps. On another note, Under Resourced Languages has an unfortunate acronym.

So...verdict?

HughP commented 9 years ago

URL very funny...

If we are both on want a quick chat?

cesine commented 9 years ago

My vote is for low resourced languages with the tagline which Richard has endangered, minority, and low or under-resourced human languages

i've heard people call them LRLs, so I tried googling that, and found yet another option for you to consider :(

https://www.google.com.vn/search?espv=2&q=LRL+language&oq=LRL+language

you'll have to fish to find out who uses less vs low

These are my hits with a wee-bit of context (keeping in mind my search results are in a country where some internet censoring goes on, eg Google+ and Facebook are blocked, not sure about other things, and i often visit ASR docs)

https://www.google.com/search?espv=2&q=less+resourced+languages

About 472,000 results (0.28 seconds) 
Scholarly articles for less resourced languages
… of RBMT systems for less-resourced languages - ‎de ILARRAZA - Cited by 14
… speech recognition for under-resourced languages: … - ‎Le - Cited by 43
… of Semantic Relations for Less-Resourced Languages - ‎Nikulásdóttir - Cited by 7

https://www.google.com/search?espv=2&q=low+resourced+language

About 420,000 results (0.29 seconds) 
Scholarly articles for low resourced languages
… and restructuring for low-resourced languages - ‎Cui - Cited by 15
… and restructuring for low-resourced languages - ‎Cui - Cited by 10
… adaptation methods for low-resourced languages: … - ‎Cucu - Cited by 7

https://www.google.com/search?espv=2&q=under+resourced+languages

About 430,000 results (0.28 seconds) 
Scholarly articles for under resourced languages
… : Corpus building for under-resourced languages - ‎Scannell - Cited by 63
… speech recognition for under-resourced languages: … - ‎Le - Cited by 43
… of an ASR System for Under-Resourced Languages … - ‎Vu - Cited by 23

renaming repos in github now will create re-directs so you can start with LowResourcedLanguages and switch later as the world and literature evolves... i'm excited for you to create the org so i can help with the bots... and get submit repos that i know for inuktutut and objibwe which need github visibility

HughP commented 9 years ago

I'm down with a clear tagline.

cesine commented 9 years ago

I sent you both a hangout chat to your Google+ accounts to see if we want to say hi now :)

rtxanson commented 9 years ago

I'll try to have a listen to the chat a bit later today. One thing that comes to mind given the name issue, is we could still make an attempt at being googleable under other terms as well, no matter what the name ends up being. I'd also be all for setting up some sort of Github-pages repo so that we can perhaps capture a few more people searching for this kind of stuff with a simple front page, explanation of the rationale.

RichardLitt commented 9 years ago

Yeah, a GitHub pages front makes sense. We can do that.

RichardLitt commented 9 years ago

I've created a repository called LowResourceLanguages. As such, this issue can be closed.

https://github.com/LowResourceLanguages/about

The first issue on that repository involves filling out the README with a description of the organization aims. Those can largely be copied from this issue.

cesine commented 9 years ago

I also added the tagline to the organization.

When i was searching for packages in #63 i couldn't find anything for LRL or low resource language or any combination of key words I could think of, except on GitHub. Part of this is that package manager search engines are usually not very good, but part if it is also that "language" is so ambiguous when working with code repositories that we have little to no chance of finding a human language project. I think we could recommend to other devs who care about low resource languages to use the tag "LRL" on their projects and maybe this would help make their projects more discoverable through traditional package managers. (on a parallel with NLP which was very easy to find in all package managers and almost always constantly meant Natural Language Processing)

I couldnt fit LRL in the description so I put it in the name which might be confusing.

Now the project page looks like if you're not signed into github

screen shot 2015-08-02 at 10 45 21 am
RichardLitt commented 9 years ago

Thanks @cesine. I would just remove 'human' and add in 'lrl' at the end.

In the future, it might be better not to add comments to closed issues or PRs, as the discussion on closed PRs or issues is assumed to be over. Opening a new issue isn't a problem. :)

RichardLitt commented 9 years ago

Renamed repo. Closing this now.