RichardLitt / low-resource-languages

Resources for conservation, development, and documentation of low resource (human) languages.
Creative Commons Attribution Share Alike 4.0 International
380 stars 56 forks source link

Publication #63

Closed RichardLitt closed 8 years ago

RichardLitt commented 9 years ago

I think there should be a white paper for this repository, and I think it should be sent to LREC. It should summarize the topics we discussed in #36, mainly.

Contributors could be counted as co-authors.

Thoughts?

HughP commented 9 years ago

I like.... I'll contribute when I get back to the mainland March 7th.

RichardLitt commented 9 years ago

Sweet. It may actually be easiest to collaborate if I simply start a draft and we can go from there.

eddersko commented 9 years ago

I read through #36, and watched the YouTube video so I think I have a general sense of the topics. My experiences comes from within the domain of language revitalization - that is, creating language learning tools for community members, so my input will be focused around that. I’ll try and add things that I think haven’t been mentioned yet.

The points brought up were some of the reasons we decided to make our tools open-source. There are a lot of related projects, but there seems to be a tendency to reinvent the wheel. For example, you’ll find all sorts of talking dictionaries and language apps – many may have similar goals in mind, but they may use different approaches to achieve them. Not only do these approaches rarely become authoritative, but they may be incompatible with other projects simply because they were human or computer language specific, or accommodated different needs of the end users. This, therefore, hinders collaboration.

There needs to be a community that is willing to discuss these issues, and promote open-source software. Admittedly optimistic, I think this organization could also be one way in which we can foster such a community. It could encourage resource-design principles and best-practices which is currently lacking (in the creation of language revitalization tools for the community) via discussions on GitHub. Note that I’m not referring to the storage of language data, but rather the manipulation of the data for viewing in an accessible way. Technology is constantly changing and sooner or later, these resources will become outdated. The new tools however invite new methods, and our practices should reflect that. Again, this is optimistic, but doesn’t hurt to mention it.

In regards to the audience, there are the developers who can turn to this list in order to obtain all the information they need for suitable implementation. For example, they may ask: are there relevant pieces of code I could reuse? How did other projects solve issues that I’m currently facing (i.e. diacritics)? Then, there are the documentarians who may be interested in which tools serve the purpose of their projects since they are the ones who understand the needs of the language community. In fact, during the ICLDC, someone working on an Austronesian language asked if I knew of any other open-source resources besides the SIL tools; naturally, I brought them to this list. Lastly, the community members may themselves be energized to create their own resources, guided by the Breath of Life workshops, archived linguistic materials, or through other means. The list provides a way to display what other people have done in their revitalization projects.

Nothing I’ve mentioned is new – some of the points were mentioned in (Good 2011), (Bird & Simmons 2003) and http://www.acsu.buffalo.edu/~jcgood/WG1-Tools.pdf, but this list is what others in a position similar to mine, in my opinion, would find especially useful.

There are other tools like spell-checkers and machine translation I didn’t mention that would also promote revitalization of a language, simply because I don’t yet have experience with them. However, I’m sure computational linguists working with low-resource languages would find this list useful one way or another.

RichardLitt commented 9 years ago

Thanks for the great perspective and write up. I would like to work more with people coming from your perspective - community builders and linguists working at ground zero, with speakers. I strongly agree with your thoughts about the wheel being recreated, and I would love to help foster discussion about the proper way to go about things. One of the issues we would have to overcome for that to be possible is the incredible amount of fragmentation in the language revitalization community - something which naturally occurs as each language is different and each community is different. I am not sure how to approach bridging that - to a certain extent, lists like ILAT do this, although they tend to focus on anything remotely technical being using with languages, and also tends to be more of a news resource (which means lower involvement with stakeholders) than a discussion list. If we could create a resource for developers, that would be great. One of the ways I was hoping to bring up possibly actionable items was to promote this list as a paper - I think a lot of the thoughts you gave above would be great as a section there.

I'm incredibly busy this week, but I'll try to have a draft started soon.

RichardLitt commented 8 years ago

@eddersko @HughP @cesine @lingomat I've let this sleep for a while while I focused on other things. However, on July 31st the abstract submissions for the LSA meeting are due. Are any of you LSA members who could submit or would anyone be interested in working on an abstract with me for this?

cesine commented 8 years ago

Hi @RichardLitt i'd love to but time is getting short, do you want to open a shared google doc so we can work in real-time?


Here is my request to https://education.github.com/contact

Subject: Anonymous percentage or count of *.edu domain names among users?

How can we help? Hi GitHubers,

I'm helping @RichardLitt prepare an abstract about his GitHub org for the LSA 2016. RichardLitt/endangered-languages#63

One of the claims we should justify is why we choose GitHub.

We have been collecting blog entries and press releases (see details below) but we are hoping GitHub might have more up-to-date "publicable" data which we can cite since we have the impression that education.github.com has been growing. We were wondering if you could publish a blog post about the growth of academics on GitHub. One metric we thought which hopefully would not violate privacy policies and might help us:

Using .edu. email domains is most likely a conservative estimate:

So we figure this would result in a citable conservative measure of number of labs/universities using GitHub, or number of users who have an academic background, which speaks to our target audience being already on GitHub, or soon-to-be on GitHub if we are :)

Cesine

More details: We have found a few articles which we can cite to back up our claim, but we would like to go to the source. One point of contention from one reviewer for my ACL 2014 paper was roughly that "open development" wasn't new. In a similar vein, @HughP wants us to address "What is new about collaborative as opposed to sourceforge which has been open for years?"

Instead of listing features, we think the numbers will probably be able to speak for themselves. Especially if we can find trends in user adoption or user type who use these features. We will probably try to use the API to see what we can see in there, but a blog entry from GitHub would be authorative and have maybe the option to discuss anonymous data which we cant analyze in the public API.

One metric we used for academic adoption in our app was university email domains, we found 49 universities had registered for our app, which was impressive. Would it be possible for GitHub Education to publish a blog post or other press release with a metric such as this without violating privacy policies?

Debating feature sets is a hard game to win. My gut says that GitHub is simply leagues ahead SourceForge in terms of collaborative features, and in its software engineering approaches for meeting the user stories of open source teams. I've used both, 12 years on SourceForge with 0 contributions to other projects and 5 years on GitHub with contributions to ~50 other projects. I know part of this is my own schedule, but I know part if it is that I was able to grow as a contributor because of GitHub's collaborative features when I switched to GitHub. I have noticed other anecdotal adoption among my academic friends, but no matter how many GitHub features we list, those who have never used those GitHub features, can't understand why those specific features are game changing.

Our current citable content:

As we work on this, we will post or other references on our GitHub for you and or others to find https://github.com/RichardLitt/endangered-languages/issues/63


Citable References:

The Emergence of GitHub as a Collaborative Platform for Education Alexey Zagalsky, Joseph Feliciano, Margaret-Anne Storey, Yiyun Zhao, Weiliang Wang, 18th ACM conference on Computer-Supported Cooperative Work and Social Computing (CSCW), 2015, ACM. [PDF] [DOI] [Slides] http://alexeyza.com/publications/

The (R) Evolution of social media in software engineering Margaret-Anne Storey, Leif Singer, Brendan Cleary, Fernando Figueira Filho, Alexey Zagalsky, In Future of Software Engineering (FOSE) track of ICSE, 2014, ACM. [PDF] [DOI] http://alexeyza.com/pdf/fose14.pdf

Other:

http://google-opensource.blogspot.ca/2015/03/farewell-to-google-code.html https://researchremix.wordpress.com http://programmers.stackexchange.com/questions/164618/why-are-many-programmers-moving-their-code-to-github http://www.inferencelab.com/free-github-private-repos-for-academics/ http://blogs.lse.ac.uk/impactofsocialsciences/2013/06/04/github-for-academics/ http://www.javaworld.com/article/2157321/open-source-tools/github-rolls-out-the-red-carpet-for-scientists.html http://marciovm.com/i-want-a-github-of-science/ http://acl2014.org/acl2014/W14-22/W14-22-2014.pdf http://ben.balter.com/2014/10/08/open-source-licensing-for-government-attorneys/

https://twitter.com/GithubEducation https://github.com/blog/1992-eight-lessons-learned-hacking-on-github-pages-for-six-months 600,000 GitHub projects use a GitHub page for hosting (good argument that hosting the org and having its results on its github page is a good idea) https://github.com/blog/1950-github-town-hall-open-source-and-academia “how institutions must adapt to support software tools and their producers, and what universities can learn from the open source community” https://github.com/blog/1840-improving-github-for-science “collaborators from academia, business, and various research labs” https://github.com/blog/253-using-git-for-research https://github.com/blog/1864-third-annual-github-data-challenge

http://readwrite.com/2014/09/08/learn-to-code-digital-literacy GitHub reaches out to promote “coding as the new literacy” hosted last Thursday in San Francisco by social coding site GitHub, panelists discussed the importance of teaching students how to work with technology, and the important difference between learning to code and learning about code.

HughP commented 8 years ago

@RichardLitt @cesine I am an LSA Member. Where is the link?

RichardLitt commented 8 years ago

Cool. Shared Google doc I will seed later today.

Here is the email I got:

I am writing with a reminder that abstracts for posters and 20-minute papers for the LSA's 2016 Annual Meeting must be submitted no later than July 31, 2015 . Technical support will not be available after 5:00 PM US Eastern Time on July 31. If you plan to submit an abstract, please renew your LSA membership and make sure that you have all the necessary information at hand. To renew your LSA membership, log in to the LSA website and go to http://www.linguisticsociety.org/user/renew.

To submit an abstract for the 2016 Annual Meeting, log in to the LSA website, then go to http://www.linguisticsociety.org/abstract/add or click "Submit Abstract" on your member profile page. Nonmembers of the LSA can join here; discounts are available for students and underemployed members. Full details on the abstract submission process and guidelines are available here.

The 2016 Annual Meeting will be held from January 7-10 at the Marriott Marquis in Washington, DC. See the Annual Meeting page for more information about plenary speakers, hotel reservations, and meeting registration rates.

We look forward to receiving your abstracts. Please contact us at lsa@lsadc.org or by phone at 202-835-1714 if you need assistance logging in to the LSA website, renewing your membership,or submitting your abstract.

HughP commented 8 years ago

@RichardLitt @cesine I put some paragraphs and a comment in the doc...

cesine commented 8 years ago

@ymanyakina do you have some time to help richard with this given your role with the Language Conservancy (and previous experience as an open source intern who spent considerable time hunting for resources)

Here are my todos which I'm working on before I can help with some text in the abstract

RichardLitt commented 8 years ago

@cesine Thank you! This is all incredibly helpful.

I'm traveling today from SFO to Boston, and I haven't had much time at all over the past couple of days as I have been at a conference. I'm boarding a plane now, but I'll have a layover at 6:30 for an hour, so I'll work on the flight and then update the google drive when I land.

HughP commented 8 years ago

@cesine Thank you for that data. it is awesome... would it be useful to move that into a google spread sheet or CSV? for the list of developers, I think would could query the itunes store and find all the different developers for language based apps.

I think I am going to need to submit this abstract before @RichardLitt lands. @cesine do you want to give it a look over before I submit it?

RichardLitt commented 8 years ago

Well, that is submitted! HughP, do you have the final version somewhere? Perhaps we should add it here so that we can begin filling out a larger paper?

HughP commented 8 years ago

@RichardLitt the final version is in the Google doc. You'll see the versions -- I updated version 3 to be the submitted version. But perhaps we should move it from a GoogleDoc to a git repo? this way we can track changes? I mean Google does also track changes... On second thought maybe keeping it as a Gdoc is a good idea. I had to say that this material was not previously presented or published... whatever that means these days.

RichardLitt commented 8 years ago

I don't think having it in a git repo counts as published. Let's keep it where it is until we hear back, and then work on it some more, perhaps?

cesine commented 8 years ago

Hi guys I stumbled on some cool research involving the use of github for collaboration (i got an invite to respond to a survey about android stack traces because of I reported some stack traces on github, also a cool project (paper) (data) using github for open research)

http://www.gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github

Classic forms of code contributions to collaborative projects include change sets sent to development mailing lists or issue tracking systems and direct access to the version control system. More recently however, a big portion of open source development happens on GitHub. One of the main reasons for this is the fact that contributing to a GitHub project is a relatively pain-free experience.

What motivates contributors? The main motivation for contributing to a project is its usage. This usage can be a dependency from another project the contributor is developing or fixing an end user bug. Altruistic motives (still) play a role: 33% of the respondents mentioned that they want to devote their time to a good cause. Developers also contribute for natural interest and personal development reasons, for example to sharpen their programming skills or for the intellectually stimulating fun of it. Finally, approximately 35% of the respondents related contributions to career development, raging from enriching their profile to attraching new customers.

I think richard is already using some of gousios' tips and tricks from his answer to "many colleagues asked me how I managed to get so many responses." but outside repetition = bonus http://www.gousios.gr/blog/Scaling-qualitative-research/

RichardLitt commented 8 years ago

That's an interesting blog post. Sounds about right to me, based on my time on GitHub. I'm not sure what answer you mean, but yeah, cool. :+1: Do you think we should put this somewhere so that people who are not familiar with GitHub models of PRs could have some reading material on it?

RichardLitt commented 8 years ago

PLoS might also be a good resource. CF this database from simon greenhill.

RichardLitt commented 8 years ago

Currently at LREC presenting the paper!

RichardLitt commented 8 years ago

So, we published. I linked. I talked. I think that this issue can be closed.

It would be worthwhile to reopen, resubmit, and publish elsewhere, if the list changes significantly. Not sure it will at the moment.