ainsleyyoshizumi / CSC475Dialects

0 stars 1 forks source link

CSC475 Seminar Spring Project

Hello, my name is Ainsley Yoshizumi. I am a senior at Furman University

who is currently working on a program aimed at using natural language processing technology to identify different Spanish dialects.

Background

Following Mandarin, Spanish is one of the most spoken languages in the world. The

diffusion of the Spanish language started with the expansion of imperial Spain. Today, Spanish is spoken on four continents including the Americas, Europe, Africa, and Asia. Over time Spanish has diverged into many different dialects, indigenous and non-indigenous. Some popular dialects in Spain include Catalan and Castilian. In Latin America, there is Mexican Spanish, Argentinian Spanish, Chilean Spanish, the list goes on and on. A common misconception when it comes to Spanish speakers is one considering Spanish as the sole language or dialect being spoken. However, Spanish is incredibly diverse. The many variations illustrate the lack of research and understanding surrounding the language’s diversity. Studying Spanish and its morphology can unlock many new insights and perspectives on the culture and sounds of the language. [1] [2] [3]

Goal

To explore the many different sounds of Spanish and its dialects, I want to create a speech

recognition machine that can decipher between a small number of different Spanish dialects. In my Spanish linguistics class, we have been primarily concerned with the difference in sounds between different Spanish dialects. This program I am aiming to create could potentially increase the understanding between different dialects, highlighting the subtle differences in pronunciation, syntax, and lexicology.

Data

The dataset I will be building my language processor on will be a language-resource repository

on GitHub. The repository is full of zip audio files of different languages being spoken. The files I am interested in cover Argentinean, Catalonian, Chilean, Colombian, Peruvian, Puerto Rican, and Venezuelan dialects. They are large zip files published three years ago that have short clips of Spanish speakers saying various short phrases about the weather in Spanish. The repository is a free publicly available resource with various contributing authors who have created these datasets for research and published them online for anyone to use. [4]

Methods

I will need to create a program that takes audio inputs and processes them to categorize

them as a dialect. I will need to be able to parse through the audio and apply semantic analysis, as this will be a very key part of the program’s ability to differentiate between the different dialects. I would like my application to be a variant of the Google speech application available online. This application allows the user to speak when prompted and allows the user to practice their pronunciation of different words, with Google checking to see if they pronounced the word correctly or not. This style of this application is very simple and allows for a convenient user experience. I would like to create something visually similar to this feature.

Evaluation

I will measure the success of my program through a mixture of both qualitative and

quantitative approaches. I will be recording the accuracy of my program as a percentage to assess if a machine can be effective in classifying different Spanish dialects. The qualitative approach will concern areas of the project surrounding my overall observations and experience of crossing technology and linguistics. By the end of the course, my project will ideally contribute to linguists researching the many different Spanish dialects and what makes each different either morphologically, phonetically, or grammatically.

Dissemination

My project deliverables will be the software application and the code written to

create it. The software and final code will also be available publicly online, I am very open to the idea of publishing my project to a pre-print server, as one main goal of my project is to contribute to the ongoing research in Spanish linguistics. I would like to present my project with a demo using my computer at Furman Engaged, in addition to the course presentation and poster presentation.

References

[1] Jose Ignacio Hualde. Introduccion a la linguistica hispanica. Cambridge Univ. Press, 2011.

[2] Joseph A Wieczorek. Spanish dialects and the foreign language textbook: A sound perspective. Hispania, 74(1):175–181, 1991.

[3] Silvia Martinez. Dialectal variations in spanish phonology: a literature review. Echo, 6(2):1–8, 2011.

[4] Multiple Contributors. Language-Resource Repository. ://github.com/google/language- resources?tab=readme-ov-file. [Online; accessed 24-January-2023].