dhsvendsen / practical_data_science

0 stars 1 forks source link

Ex. 4.2.1: Where did the last 50 rows go #2

Closed nostango closed 3 years ago

nostango commented 3 years ago

Issue summary

some rows are missing. I placed all relevant code here

Code and output (if applicable)

import numpy as np

def get_alliances(char, faction=None): """Return list of alliances for Marvel character.

Input
-----
    char : str
        A valid character name of any faction
    faction : str
        Either 'heroes', 'villains', 'ambiguous' or None. If None, the function
        looks through the respective faction folders to figure out which faction
        `char` belongs to. The function is therefore FASTER if `faction` is provided

Output
------
    out : list of strings
        List of alliance names
"""

# If faction is not provided, figure out which faction it is by looping through
# folders of character names
if faction is None:
    for faction in ["heroes", "villains", "ambiguous"]:
        if char + ".txt" in os.listdir("data/%s" % faction):
            break

# Load character markup
with open("data/%s/%s.txt" % (faction, char)) as fp:
     markup = fp.read()

# Get alliance field
alliances_field = re.findall(r"\| *alliances[\w\W]+?(?=\|.+=|\}\})", markup)
if alliances_field == []:
    return []

# Extract teams from alliance field
return [
    t[2:-1]
    for t in re.findall(r"\[\[.+?[\]\|\#]", alliances_field[0][10:])
    if not 'List of' in t
]

lists that contain each character and the list that will store all the teams in the Marvel Universe

heroes = [name for name in os.listdir('data/heroes')] villains = [name for name in os.listdir('data/villains')] ambiguous = [name for name in os.listdir('data/ambiguous')] all_teamsa = [] all_teams = []

removes the '.txt' in a file in the directory

def remove_txt(fil): for i in range(len(fil)): word = fil[0] newhero = word.rsplit(".", 1)[0] fil.remove(fil[0]) fil.append(newhero) return fil

remove_txt(heroes) remove_txt(villains) remove_txt(ambiguous)

will take the teams from each character and then put them in the all teams list

def get_all_teams(charac): for l in range(len(charac)): #gets the number of files in the array of characters for team in get_alliances(charac[l]): #gets each team from the get_alliances function all_teams.append(team)

getting all the teams from the three different categories

get_all_teams(heroes) get_all_teams(villains) get_all_teams(ambiguous)

will take the character and return a vector representation of the alliances they are affiliated with

def vector_teams(charac):

# creating a vector of zeroes with the length of all_teams
vector_rep = []
for i in range(len(all_teams)):
    vector_rep.append(0)

#find the indices of the teams that a charcter belongs to
teams = get_alliances(charac)
indices = []
for te in teams:
    for j in range(len(all_teams)):
        if te == all_teams[j]:
            indices.append(j)

#will find the alliance part of the vector_rep and add 1
for k in indices:
    vector_rep[k] += 1

return vector_rep

new array that only has heroes and villains

new_all_char = heroes + villains new_all_char.sort()

for c in new_all_char: if len(get_alliances(c)) == 0: new_all_char.remove(c)

creating the target array

target_arr = [] for i in range(len(new_all_char)): target_arr.append(0)

enum = dict((c, i) for i, c in enumerate(new_all_char))

create the target array depending on whether the character in question is a hero, villain, or ambiguous

for c in new_all_char: for faction in ["heroes", "villains", "ambiguous"]: if c + ".txt" in os.listdir("data/%s" % faction): if faction == 'heroes': target_arr[enum.get(c)] += 1

d2 = {}

for c in new_all_char: d2[c] = vector_teams(c)

turn the dictionary holding the 2-D matrix into a 2_D list

dataMatrix = list([d2[i] for i in new_all_char])

X_ta = dataMatrix y_ta = target_arr

data_teams = pd.DataFrame.from_dict(d2, orient='index', columns = all_teams) data_teams['faction'] = target_arr data_teams

Insert your minimal code example here that e.g. reproduces your error or otherwise examplifies the problem you are having.

dhsvendsen commented 3 years ago

Hi Rudy! When I run your code I end up with a dataframe of size: 1013 rows × 2639 columns When generating all_teams you add all the teams that each character is a part of. This is correct but it comes with duplicates which can be removed by making it a set, and then back into a list (because sets have no duplicates. all_teams = list(set(all_teams)) all_teams.sort() Furthermore, you remove the ambiguous characters which is good for the next exercise but it is actually easier to make a big dataframe with everyone (heroes+villains+ambiguous) and then sorting them out them later. Then, if you remove the people that don't have any alliance have_allies = data_teams.drop(columns=['faction']).sum(axis=1) > 0 data_teams = data_teams[have_allies] you should end up with a 957 rows × 503 columns dataframe. Let me know! :)

nostango commented 3 years ago

Hey Daniel! Thanks for the help, I got 957 rows x 503 columns for my data!

dhsvendsen commented 3 years ago

Perfect! 👍