Project Update #2 - Githubissues

brucknerp commented 6 years ago

We finally had our first meeting on Friday afternoon, and went over a lot of the logistics of the project.

My (Paige's) professor suggested several lyric database sites to use, and this site looks good, because it has a handy "time machine" function that lists the top 20 songs. It also lists the singer and songwriter, so if there is a difference in gender there, we can account for that.

First, we're going to pick the top 10 (5 male, 5 female, hopefully singer-songwriters) songs from several decades until 2017 (since it's still a bit early for 2018). There would be ten years between songs, so if we start at 1980, we would go to 1990, to 2000, then 2010, then 2017 for a very recent look. This gives us around 50 songs, which we could expand given that we have extra time.

Linguistically, we will be looking mostly at sentence-final particles, pronoun usage, and perhaps formality levels and Chinese character compound usage.

As for issues, our biggest roadblock right now is the issue of copyright laws, since these lyrics are by no means ancient works in the public domain. However, looking at the general regulations, it seems that educational purposes are fine. Also, the website linked above lets people take the lyrics and post to their own blog, so it seems safe. Still, we will keep looking into this.

Next comes marking up these songs, but we are still working on the best method of doing this--any suggestions? Songs are pretty regular in form.

For this week, our goal is to finalize a list of songs, start compiling a list of features to analyze, and start tagging some songs!

JosephDRogers23 commented 6 years ago

Our project also had a lot of issues with copyright, and it turned out that we weren't able to resolve them in the end and had to change our research question. I sincerely hope that you guys can find a way around it! Although it's not a Pitt resource, we found this site to be helpful in decoding copyright law for academic use: http://sites.umuc.edu/library/libhow/copyright.cfm .

Idi0teque commented 6 years ago

For marking up songs, you could go line by line, separating into chorus, verses, bridges, etc. Otherwise you could just focus specifically on gendered words within the songs, marking those up as you go; as far as Regex is concerned it's probably easier to just manually mark them up (unless they're like the 15-minute, 130-line song I marked up earlier in the semester.) Hope this isn't too useless! Also, I think TEI has specific guidelines for song markup if you're using it.

enb34 commented 6 years ago

As far as marking up the songs, I would probably do something similar to what @Idi0teque suggested and mark up the structure with Regex, and target the gendered words or particles separately. You may be able to find most of the sentence ending particles using Regex though, since the verb endings are fairly easy to identify; however, they can vary so it may be easier to do this manually. I believe it may have been the Russian Fairytales project that went through a few works manually and was able to markup most of the rest of the works automatically, so that may be an option for your group as well.

I am curious about your choice to markup the Chinese character compounds. Perhaps I'm missing your intent, but it would seem to me that which character compounds are used would be more in the hands of the transcriber than the singer. If your project is focusing on the transcriptions of these popular songs, will you be able to tell for certain if the songs themselves were transcribed by the artist/songwriter, or if they were transcribed by someone else (perhaps an editor of the database listening to a recording of the song)? Even if it's the latter case, it may still be interesting to see if the gender of the singer impacts the way the song was transcribed, but that may be too divergent from your original research question.

mtm80 commented 6 years ago

Using a different alphabet is one thing but I have to wonder if XML is compatible with Chinese characters as they are not an alphabet. Will it even be possible to us XML to search through these characters or will you have to use texts that are transcribed into pinyin?

djbpitt commented 6 years ago

Chinese texts are well represented in XML technologies. See, for example, Marcus Bingenheimer’s Buddhist projects at http://mbingenheimer.net. Closer to home, Angela’s project partner when she was a student in the course worked with Chinese materials in Chinese (Angela was responsible for the Japanese side of the project, which she implemented in English translation). You can visit their site at http://shi-waka.obdurodon.org/.

djbpitt commented 6 years ago

With respect to @enb34's mention of the Russian Fairy Tales project, what Gabi Kirilloff did is that she tagged the verbs of speaking in a few tales manually, and then started building a list of verbs of speaking by including those particular verbs (in all of their inflected forms, which she generated). She then used that list to autotag the next few tales (we'll show you how to do that; it's done programmatically, and not manually in the find-and-replace dialogue), after which she read through them, manually tagged the new verbs, and added them to the list, and moved on to process the next few tales. At the beginning she was adding a lot of new verbs on each pass, but that quickly leveled off. It was never fully automated (that is, she always read and edited the output of the autotagging), but it was nonetheless faster than tagging everything by hand, and smarter than trying to imagine all possible verbs of speaking at the beginning.

brucknerp / jpn_gen_pop

Project Update #2 #2