akrinaldi / LatinX-metadata

0 stars 0 forks source link

Keyword list expansion #4

Open akrinaldi opened 2 years ago

akrinaldi commented 2 years ago

Adding individual names of countries resulted in a massive increase in results (302 to 1422). It appears our original keyword list did not have the scope necessary to pick up data noted in the ~400 original articles pulled by the team.

Additionally, some do not have any keyword correlation.

akrinaldi commented 2 years ago

Keyword list (with country names added)

updated_kw_list = ['Argentinian American', 'Belizean American', 'Chicano American', 'Latino American', 'Latine', 'Bolivian American', 'Boricuas', 'Brazilian American', 'Chilean Americans', 'Colombian American', 'Costa Rican American', 'Costarisences', 'Cuban American', 'Dominican American', 'Ecuadorian American', 'Afro-Hispanic', 'Afro-Latino', 'Guatemalan American', 'Hispanic American', 'Hispanos', 'Honduran American', 'Mejicano', 'Mexican American', 'Nicaraguan American', 'Panamanean American', 'Paraguayan American', 'Peruvian American', 'Puerto Rican American', 'Salvadoran American', 'Tejano', 'Uruguayan American', 'Venezuelan American', 'Argentinian', 'Belizean', 'Chicanos', 'Latin American', 'Chicanas', 'Bolivians', 'Chicanx', 'Brazilians', 'Chileans', 'Colombian', 'Costa Rican', 'Latino', 'Cuban', 'Dominican', 'Ecuadorian', 'Latina', 'Afro-Latina', 'Guatemalan', 'Hispanic', 'Latinx', 'Honduran', 'Mexicano', 'Mexicans', 'Nicaraguan', 'Panamanean', 'Paraguayan', 'Peruvian', 'Puerto Rican', 'Salvadoran', 'Texano', 'Uruguayan', 'Venezuelan', 'Argentinos', 'Belizeanos', 'Bolivianos', 'Brasileños', 'Chilenos', 'Colombianos', 'Costarricences', 'Cubanos', 'Dominicanos', 'Ecuatorianos', 'Guatemaltecos', 'Mexican Americans', 'Hondureños', 'Nicaraguenses', 'Panameños', 'Paraguayos', 'Peruanos', 'Puertorriqueños', 'Salvadoreños', 'Uruguayos', 'Venezolanos', 'latinx', 'latina', 'latino', 'latine', 'hispanic', 'hispanos', 'Argentinian', 'Belizean', 'Chicano', 'Latino', 'Mexican', 'Nicaraguan', 'Panamanean', 'Paraguayan', 'Peruvian', 'Puerto Rican', 'Salvadoran', 'Uruguayan', 'Venezuelan', 'Honduran', 'Belize', 'Costa Rica', 'El Salvador', 'Guatemala', 'Honduras', 'Mexico', 'Nicaragua', 'Panama', 'Argentina', 'Bolivia', 'Brazil', 'Chile', 'Columbia', 'Ecuador', 'French Guiana', 'Guyana', 'Paraguay', 'Peru', 'Suriname', 'Uruguay', 'Venezuela', 'Cuba', 'Dominican Republic', 'Haiti', 'Guadeloupe', 'Martinique', 'Puerto Rico', 'Saint-Barthélemy', 'Saint-Martin', 'Saint Barthélemy', 'Saint Martin', 'Saint-Barthelemy', 'Saint Barthelemy', 'Brasilenos', 'Hondurenos', 'Panamenos', 'Puertorriquenos', 'Salvadorenos', 'Dominica', 'Latin', 'latin']

akrinaldi commented 2 years ago

Variations were added for capitalization and all-lowercase variants. Also added were Spanish words related to libraries, as they appeared in many titles from the 400 articles in the original csv.

new_words = ['South America','libro','información','bibliotecari','biblioteca','en linea','diario','periódico','publicación','obra de consulta','investigación','computerizdo','revista','diccionário','diccionario','circulación','circulacion','bibliográfia','bibliografia','biografía','biografia','catálogo','catalogo','enciclopedía','enciclopedia','computora','novela','impresión','impresion','catálogo','catalogo','latin-x','Latin-x','Latin-X']

for i in new_words:
    updated_kw_list.append(i)

all_caps = []

for i in updated_kw_list:
    all_caps.append(i.upper())

all_low = []

for i in updated_kw_list:
    all_low.append(i.lower())

for i in all_caps:
    if i not in updated_kw_list:
        updated_kw_list.append(i)

for i in all_low:
    if i not in updated_kw_list:
        updated_kw_list.append(i)
akrinaldi commented 2 years ago

Other variations were added removing accented characters and ñ to account for possible cataloguing errors.