SeverinJB / the_lads

Project "Scholarly Network Engine" - Examination for "Computational Thinking and Programming" - Second-cycle degree "Digital Humanities and Digital Knowledge" at the University of Bologna
0 stars 0 forks source link

process_citation_data(file_path) #1

Open SeverinJB opened 5 years ago

SeverinJB commented 5 years ago

Functionality:

Note: Data is input for other functions.

SeverinJB commented 5 years ago

RealPython - CSV Files The linked article provides an introduction to handling CSV files in Python. The how-to includes a general explanation for the structure of CSV files. Also, it describes how to open a CSV and how to store the data in a dictionary. Hopefully, it is helpful.

dersuchendee commented 5 years ago

Thanks, it was helpful! However, it still doesn't work with my code when I try to do these operations inside a function. I'm missing something, maybe because I'm not familiar with classes yet. Doing some research on the internet, I found out it would be better not to read the .csv file inside a function, but rather with a class, because it's much less efficient.

Another doubt I have it's about the .self method. I found a website explaining it (https://coderanch.com/t/592240/languages/data-class-inheritance-Python), but I'm still not sure our case can be applied to the examples of the website, therefore I'm not sure about how to use it. I was thinking of "mimicking" the initial code of the ScholarlySearchEngine (ex. self.data = [dict(x) for x in reader])...

Also, I don't understand the path thing- I tried with the actual path and it didn't work, I tried with the file_name.csv and it still didn't work. Maybe I'm just focusing too much on the class!

SeverinJB commented 5 years ago

Maybe we should employ Pandas! For our case, Pandas' DataFrames should be useful and easy to handle. The DFs are basically tables which might be more familiar to us.

Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

Introduction to Pandas' DataFrames.

delfimpandiani commented 5 years ago

Posting what I got so far here so it's easier when we Skype:


# OUTPUT: pandas dataframe with renamed columns, DOI as index, sorted by citation number in a descending manner

# transform the csv into a pandas data frame, change the column names, specify the index column
import pandas
df = pandas.read_csv('citations_sample.csv',
            index_col='DOI',
            header=0,
            names=['DOI', 'Citation Number', 'Known References'])

# sort the dataframe by citation number (the middle column) in a descending manner
sorted_df = df.sort_values("Citation Number", ascending=False)
print(sorted_df)

# EXTRA: gives us info about our dataframe(dimension, length, column info)
print(sorted_df.shape)
print(len(sorted_df.index))
colinfo = list(sorted_df.columns.values)
print(colinfo)

Output dataframe looks like this:

dataframe output
dersuchendee commented 5 years ago

Nice! I'll also write down an 'extra' I found:

reader = pandas.read_csv('citations_sample.csv', index_col='doi', header=0, names=['doi', 'citation Number', 'known refs'], na_values='NaN') 
reader.dropna(how = 'any', subset=['known refs'], inplace = True)

This will remove all rows where there is a NaN value. The subset is where I have to look for NaN values. Notice I specified na_values in the reader!

delfimpandiani commented 5 years ago
import pandas as pd
def process_citation_data(file_path):
    df = pd.read_csv(file_path, index_col='doi')
    return df

#test with our data
print(process_citation_data('citations_sample.csv'))
SeverinJB commented 5 years ago

Classes explained on Python.org

Creating a new class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.

delfimpandiani commented 5 years ago

@SeverinJB thank you for the classes link!

Here is the extra code for process_citation_data in case in the future we need to sort or get info on the dataframe:

# EXTRA: gives us info about our dataframe(dimension, length, column info)
# sorted_df = df.sort_values('cited by', ascending=False)
# print(sorted_df)
# print(sorted_df.shape)
# print(len(sorted_df.index))
# colinfo = list(sorted_df.columns.values)
# print(colinfo)