Ideas for collaboration and long-term archiving

smatsushima1 commented 1 year ago

Hello, I would love to help contribute to this and transcribe some talks but I have some ideas on how to manage each transcription. I'd prefer to speak this verbally since it may be easier to explain the rationale, so if you are then let me know if you are interested in talking over messenger. Here is a short list of items:

save all transcriptions as .txt files first to allow for portability, to be later saved as pdfs, htmls, or whatever; further, textual analysis can be performed on these to identify keywords more succinctly
rename files to "YYYYMMDD_XXXXX_title" so for example: "20180619_00010_no_one_wants_sensuality.txt" - this will allow it to be stored chronologically while maintaining the talk numbers which I believe you stated were inconsistent
keep time stamps and speaker names in each transcription; these can always be removed later but serve as important benchmarks in case the folks at HH want to read more into them and edit appropriately
separate folders for completed and working transcriptions so if there are others who may want to join us, can easily just pick up something from the working folder to work on. I'm in the process of identifying everything that has been transcribed so will have a more complete index of all the talks that still require work

I've created my own repo and can do a pull request to merge all txt files I've modified if you are interested. Let me know what you think.

BBBalls commented 1 year ago

Hello,

I am happy you are interested in collaborating with transcribing Hillside Hermitage talks.

Unfortunately, my internet access is not conducive to communicating verbally. We will have to coordinate via text exchange.

I am not overly concerned with what extension is appended to the end of the filename, as long as the file is plaintext and utilizes a plaintext markup language, preferably a common flavor of Markdown. The .md extension just indicates that the plaintext file is utilizing the Markdown markup language. The rationale for this is that a wide variety for text formats can be generated from Markdown using Pandoc. The more I think on this, my preference for Markdown increases; it is just more robust and versatile than plaintext with no formatting information. I am open to entertaining an argument for the advantages for straight plaintext though.

It is trivial to convert a Markdown document into a PDF, html, .docx, or .odt. The YAML header in the Markdown document is the metadata for populating the template fields utilized to generate the various other document formats.

For an a demonstration, 20210216 - 197hh - Can a layman be an arahant.md will be used.

pandoc -f markdown -t docx -o '20210216 - 197hh - Can a layman be an arahant.docx' '20210216 - 197hh - Can a layman be an arahant.md' outputs a fully formatted .docx document 20210216 - 197hh - Can a layman be an arahant.docx

The above just utilizing the Pandoc defaults, but Pandoc allows for a great deal of customization.

I don't have a pdf engine installed, at the moment, so I can provide an example of of a pdf. And, GitHub doesn't allow uploading a html file here, so I cannot provide an example of that either. The basic commands for example are pandoc -f markdown -t pdf -o '20210216 - 197hh - Can a layman be an arahant.pdf' '20210216 - 197hh - Can a layman be an arahant.md' and pandoc -f markdown -t html -s -o '20210216 - 197hh - Can a layman be an arahant.html' '20210216 - 197hh - Can a layman be an arahant.md'
The utilization of underscores in place of spaces is a better convention for file naming, and I think that is good to adopt it. The naming scheme of 'yyyymmdd ### title' is already in place.
I am not opposed to this idea in principle, and it is relatively easy to do with with a tool like Parlatype. This is an idea that might be appropriate to ask Bante Thaniyo Thero about. My vision of creating transcripts is to provide help to Hillside Hermitage in the generation of summarized transcriptions and the books of talk collections, not create a public facing transcription archive. The additional work of adding and then removing timestamps is only interesting to me, if it is helpful to Hillside Hermitage.
This may work as a way to reduce potential redundancies of effort.

I have been toying with the idea of utilizing a Google Sheet as mechanism to volunteer to transcribe a talk, and provide links to the audio and the raw generated transcript. It would allow for easily identifying what is already being worked on, and minimize the need to manually move files around.

smatsushima1 commented 1 year ago

So I just spoke with Bhante Thaniyo and he would prefer to have either transcribed talks in .docx to be sent to him in Telegram, or a shared Google folder with him that has the documents. I think what you made here is amazing but as far as sending him the transcripts, I may end up just creating a Google folder. I'll include a word doc with a table or a spreadsheet that lists all the talks that have already been transcribed. This can have separate columns to denote who is currently working what in the event we get more folks on board to this.

Also, I asked in regards to the timestamps because I utilize Otter to do most of the heavy lifting. I found them to be the most efficient in terms of transcribing everything in general since I can easily go back to a certain area that requires more attention and transcribe away. Otter covers around 80% of the material so it is a time saver. This can also help the HH folks in determining where something is said.

Thanks for the collaboration with this.

BBBalls commented 1 year ago

Yeah, Bhante Thaniyo's preference for .docx is partly why I am partial to markdown. Working with plaintext for producing transcripts and collaborating on Github is much easier, and using markdown allows for easy conversion to .docx without needing to do additional editing. I have shared several transcripts with Bhante Thaniyo over Telegram in the .docx format. Given the preference for .docx, Google Docs / Google Drive may be the more appropriate collaboration tool.

I definitely think it will be good to have some sort of mechanism to prevent redundant efforts, regardless of our respective workflows. Here is a draft Google sheet that I made for the purpose of helping keep track of who is doing what. I populated it with the table you generated a couple of days ago. Additional, fields that I think might be nice to add is a link to a machine generated transcript, and a link to an audio/video source.

What was Bhante Thaniyo's response to the inclusion of time stamps?

smatsushima1 commented 1 year ago

I like the spreadsheet I think that can work out great. Bhante was okay with timestamps being in there. Also, he gave me a few priorities for talks in my previous chat with him. I can update the priority column to reflect these. Where did you want to save the transcriptions? When I edit these in otter, I can export it as docx so I'm not sure anymore if I want to worry about converting the files if it's to be saved as docx. Worst case scenario, I save it as txt then copy and paste into a word document. then just give this to Bhante. Did you want to save all our transcriptions on your github page?

BBBalls commented 1 year ago

I reconciled your records of completed talks with my own on the Google Sheet. There were seven talks that I accounted for that is not on your list. They are indicated by "diff" in the transcriber column. I also added the talks transcribed at https://dhamma.stream/transcripts/, which are denoted by "dhamma stream" in the transcriber column.

Right now on the, the permissions on the Google Sheet only allow others to edit the transcriber and date column. We should communicate through a different channel to establish permissions.

I think the best thing to do for submitting new transcripts is to make a Google Drive folder, because it is something Bhante Thaniyo has stated he is comfortable using, and it means I won't have transfer files around. Sharing folders and setting permissions in Google Drive is easy, at least with a small group of contributors.

Regarding the Google Sheet, it would be nice to have a column for published dates, so talks that are not episodes can be accounted for in a manner that can be sorted e.g. the Nanavira series. You seem to have some skill in programming, would it be a difficult task to generate a list of published dates in the the spreadsheet you created?

smatsushima1 commented 1 year ago

I thought about including the published date as well as youtube link to each talk but upon looking at how to do that, I put it aside to be tackled later. I can give it a shot. I'll have to go into the youtube api and search for the HH channel. Sounds like fun - I can work on it now. If you want, we can communicate on facebook messenger or telegram, since Bhante uses telegram anyway.

smatsushima1 commented 1 year ago

I just scraped the publication date and youtube link from the youtube api for all of the HH's videos and saved an updated spreadsheet in my repo. See attached. Did you want me to setup the shared Google Drive Folder or possibly add it to Bhante's recordings folders? I guess it can go either way - I'd like to make his life a bit easier but not sure if piggy-backing on his folder would be too burdensome. Let me know what you think.

results.xlsx

Also, Bhante stated that these are the priority talks to transcribe: Guided contemplation on Anger: https://youtu.be/Bw7DVHH_kNY ID: 80

The right starvation: https://youtu.be/d8p-a1X4ISs ID: 145

Difference between ariya and putthujana: https://youtu.be/gkwYOroXqK0 ID: 158

I'd like to transcribe the latest talk about how to enter solitude and afterwards, I'll get started on these once we can get the permissions settled.

BBBalls commented 1 year ago

I apologize, I misread your comment, bungling 'later' as 'latter', so thought you were going to work on scraping links. I will read more carefully in the future.

Thanks for scraping the pub dates. I remember finding mistakes in my records, so what you gathered is a better data set.

I added the priority tag to the indicated talks. I have "Difference between ariya and putthujana" as ID: 161 though.

"Guided contemplation on Anger" ID: 80 has been transcribed already; it is the chapter "Contemplation of Anger" in Dhamma Within Reach. It might be good to get clarification on this.

00315_when_to_go_into_seclusion is now reserved under your GitHub handle. I also recorded that you completed 00207_the_danger_contemplation.

I have setup a Google Drive folder that can be the collaboration folder. I was think of sharing the folder with Bhante Thaniyo, so he can place it where ever works for him inside his Google Drive; he might not want the transcripts available to everyone before he has a chance to edit them.

smatsushima1 commented 1 year ago

I noticed the same thing about the "Guided contemplation of Anger", my notes have it as recorded in the book as well. I surmised that it wasn't 100% transcribed but I'll have to verify it. I've sent you an email now.

BBBalls / hillside_hermitage_archive

Ideas for collaboration and long-term archiving #6