codeforsanjose / city-agenda-scraper

9 stars 16 forks source link

Spanish / Vietnamese translation for city documents #26

Closed krammy19 closed 3 years ago

krammy19 commented 3 years ago

Code for San Jose is applying for grant funding to expand translation services for city meetings. As part of this initiative, they want to see if we can include Spanish and Vietnamese translation in the agenda scraper project.

For this service, we would perform translation on the following documents that will be scraped regularly:

In this initial phase, we need estimates on the cost and time to add a translation script into our project. As this grant will be used exclusively for San Jose, this translation script will need to be optional for running our code for other cities.

mkumar10 commented 3 years ago

Translation_Cost_Analysis.pdf Seems like after doing analysis for just City Agendas and Staff Reports - it comes down to free only for both languages as first 500k chars are free and $20 per million subsequent characters. I will convert meeting minutes doc from pdf to txt later tonight and add it to the analysis.

mkumar10 commented 3 years ago

Note this was under the assumption that all text files were extracted properly/close to perfect from the pdfs but seems like that's not the case so this is not accurate w.r.t. pdfs but accurate w.r.t. text files in the sub-folder.