Best-Brain-Gang / Crowdfunding_Analysis

This project attempts to help project creators understand how to market their project on Kickstarter vs Indiegogo. The purpose of this tool is to not only give them advice on platforms but be able to efficiently market their project and provide supporting data for whichever is the better platform.

3 stars 1 forks source link

Discussion issue which columns to remove for all csv files #8

Closed juzcho closed 3 years ago

juzcho commented 3 years ago

6, #7, #11, #12, #14 are dependent on this one.

n8patterson commented 3 years ago

Adding links for the three datasets we are using so far:

kickstarter_large_dataset folder -> https://www.kaggle.com/kemical/kickstarter-projects

From this one we will use the updated more recent file. (ks-projects-201801.csv)

kickstarter_small_dataset folder -> https://www.kaggle.com/socathie/kickstarter-project-statistics

For this I think we can use both the live and most_backed csv files.

indiegogo csv -> https://www.kaggle.com/quentinmcteer/indiegogo-crowdfunding-data?select=indiegogo.csv

See pull request which imports and cleans the IDs of the csv files #21

n8patterson commented 3 years ago

Let's focus first on joining together the kickstarter_large_dataset and the indiegogo csv then we can add the kickstarter_small_dataset if we can/need...

Columns for the kickstarter_large_dataset:

'name'
'category'
'main_category'
'currency'
'deadline'
'goal'
'launched'
'pledged'
'state'
'backers'
'country'
'usdpledged'
'usd_pledged_real'
'usd_goal_real'

Columns for the indiegogo csv:

'currency'
'category'
'year_end'
'month_end'
'day_end'
'time_end'
'amount_raised'
'funded_percent'
'in_demand'
'year_launch'
'month_launch'
'day_launch'
'time_launch'
'tagline'
'title'
'url'
'state'
'date_launch'
'date_end'
'amount_raised_usd'
'goal_usd'
'australia'
'canada'
'switzerland'
'denmark'
'western_europe'
'great_britain'
'hong_kong'
'norway'
'sweden'
'singapore'
'united_states'
'education'
'productivity'
'energy_greentech'
'wellness'
'comics'
'fashion_wearables'
'video_games'
'photography'
'tv_shows'
'dance_theater'
'phones_accessories'
'audio', 'film'
'transportation'
'art'
'environment'
'writing_publishing'
'music'
'travel_outdoors'
'health_fitness'
'tabletop_games'
'home'
'local_business'
'food_beverage'
'culture'
'human_rights'
'podcasts_vlogs'
'camera_gear'
'jan'
'feb'
'mar'
'apr'
'may'
'jun'
'jul'
'aug'
'sep'
'oct'
'nov'
'dec'
'tperiod'

n8patterson commented 3 years ago

Columns mappings for the kickstarter_large_dataset to indiegogo:

'name' -> 'title'
'category'
'main_category' -> 'category'
'currency' -> 'currency'
'deadline' -> 'date_end', 'time_end'
'goal'
'launched' -> 'date_launch', 'time_launch'
'pledged'
'state' -> 'state'
'backers'
'country' -> 'australia', 'canada', 'switzerland', 'denmark', 'western_europe', 'great_britain', 'hong_kong', 'norway', 'sweden', 'singapore', 'united_states',
'usdpledged' -> 'amount_raised_usd'
'usd_pledged_real' -> 'amount_raised_usd'
'usd_goal_real' -> 'goal_usd'

We can use this info to decide which columns we want to keep from each dataset and then after that clean and rename the columns for each dataset to create similar data frames.

n8patterson commented 3 years ago

Columns we should keep from the indiegogo csv that we need to calculate or add to the kickstarter_large_dataset:

'funded_percent'

n8patterson commented 3 years ago

See #23 that creates two smaller data frames based on the matching columns above. Next steps....

Review that this code is ok
We need to figure out if we want to merge the kickstarter_small_df into the kickstarter_large_df or keep it separate. My vote is to keep it separate.
Although #23 creates smaller dfs based on matching columns, now we need to clean both data frames. Ex. fix and merge all the date columns in the indiegogo df to match the date cols in the kickstarter df etc....

juzcho commented 3 years ago

Do we just create a whole new csv file with our clean version?

Narwhilian commented 3 years ago

Do we just create a whole new csv file with our clean version?

We could but I think it would be easier to just use them as dataframes in the program once we have them read in. We could also create a SQL database with the cleaned csv data as each table, that way if we add more data sources we could run them through the cleaner and add them to the DB and access the data using the same SQL queries. (I might be making it more complicated than it needs to be with the SQL though)

juzcho commented 3 years ago

Do we just create a whole new csv file with our clean version?

We could but I think it would be easier to just use them as dataframes in the program once we have them read in. We could also create a SQL database with the cleaned csv data as each table, that way if we add more data sources we could run them through the cleaner and add them to the DB and access the data using the same SQL queries. (I might be making it more complicated than it needs to be with the SQL though)

Okay if that is the case there wont be a need to put them together as one new csv. At most we just put it in the code which columns to remove then make it a new New Dataframe.

n8patterson commented 3 years ago

O

Do we just create a whole new csv file with our clean version?

We could but I think it would be easier to just use them as dataframes in the program once we have them read in. We could also create a SQL database with the cleaned csv data as each table, that way if we add more data sources we could run them through the cleaner and add them to the DB and access the data using the same SQL queries. (I might be making it more complicated than it needs to be with the SQL though)

Okay if that is the case there wont be a need to put them together as one new csv. At most we just put it in the code which columns to remove then make it a new New Dataframe.

@Narwhilian Exactly, we could make a db and some functions to clean our code that goes in and out of our db. This is the better way to do it. I think we should do that later as an extra nice to have feature. For now I am just reading and cleaner the CSV files like we did in our assignments.

@juzcho Correct, for now we should do as you said. I think we should not put together as one csv but maybe inside of a db later as @Narwhilian suggested.

n8patterson commented 3 years ago

See #25 where I fixed the date cols in the indiegogo dataset and reordered the columns for both df so they match to make reading quicker.

n8patterson commented 3 years ago

Closed as we are using large_kickstarter and indiegogo datasets as above