Open SeverinJB opened 5 years ago
RealPython - CSV Files The linked article provides an introduction to handling CSV files in Python. The how-to includes a general explanation for the structure of CSV files. Also, it describes how to open a CSV and how to store the data in a dictionary. Hopefully, it is helpful.
Thanks, it was helpful! However, it still doesn't work with my code when I try to do these operations inside a function. I'm missing something, maybe because I'm not familiar with classes yet. Doing some research on the internet, I found out it would be better not to read the .csv file inside a function, but rather with a class, because it's much less efficient.
Another doubt I have it's about the .self method. I found a website explaining it (https://coderanch.com/t/592240/languages/data-class-inheritance-Python), but I'm still not sure our case can be applied to the examples of the website, therefore I'm not sure about how to use it. I was thinking of "mimicking" the initial code of the ScholarlySearchEngine (ex. self.data = [dict(x) for x in reader])...
Also, I don't understand the path thing- I tried with the actual path and it didn't work, I tried with the file_name.csv and it still didn't work. Maybe I'm just focusing too much on the class!
Maybe we should employ Pandas! For our case, Pandas' DataFrames should be useful and easy to handle. The DFs are basically tables which might be more familiar to us.
Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.
Posting what I got so far here so it's easier when we Skype:
# OUTPUT: pandas dataframe with renamed columns, DOI as index, sorted by citation number in a descending manner
# transform the csv into a pandas data frame, change the column names, specify the index column
import pandas
df = pandas.read_csv('citations_sample.csv',
index_col='DOI',
header=0,
names=['DOI', 'Citation Number', 'Known References'])
# sort the dataframe by citation number (the middle column) in a descending manner
sorted_df = df.sort_values("Citation Number", ascending=False)
print(sorted_df)
# EXTRA: gives us info about our dataframe(dimension, length, column info)
print(sorted_df.shape)
print(len(sorted_df.index))
colinfo = list(sorted_df.columns.values)
print(colinfo)
Output dataframe looks like this:
Nice! I'll also write down an 'extra' I found:
reader = pandas.read_csv('citations_sample.csv', index_col='doi', header=0, names=['doi', 'citation Number', 'known refs'], na_values='NaN')
reader.dropna(how = 'any', subset=['known refs'], inplace = True)
This will remove all rows where there is a NaN value. The subset is where I have to look for NaN values. Notice I specified na_values in the reader!
import pandas as pd
def process_citation_data(file_path):
df = pd.read_csv(file_path, index_col='doi')
return df
#test with our data
print(process_citation_data('citations_sample.csv'))
Classes explained on Python.org
Creating a new class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.
@SeverinJB thank you for the classes link!
Here is the extra code for process_citation_data in case in the future we need to sort or get info on the dataframe:
# EXTRA: gives us info about our dataframe(dimension, length, column info)
# sorted_df = df.sort_values('cited by', ascending=False)
# print(sorted_df)
# print(sorted_df.shape)
# print(len(sorted_df.index))
# colinfo = list(sorted_df.columns.values)
# print(colinfo)
Functionality:
Note: Data is input for other functions.