You've done a great job overall. There are many ways to accomplish the same task in pandas (and Python more generally) so it may be informative to test out some of the alternatives below. Please comment below if you have questions about these suggestions.
explore_class_data.ipynb
[21]: Using relative paths is often preferable. That way I could run this notebook even though the Pathlinker-project directory resides in a different place in my file system. os.path is a nice Python package for working with file paths across multiple operations systems.
[21]: You can test pandas head function to view only some rows, which is useful for huge tables.
[107]: Another way to do this is to define a function like calc_grade and then use pandas apply to apply that to each row. The function could take in an entire row, use the indices to select only the grade-related columns, and then sum them. Or you could take you existing approach but use the pandas column indexing syntax that allows you to select all columns between "Homework 1" and "Project" (inclusive).
[108]: Try computing the mean for the Biology, CS, and Dance majors separately. Pandas groupby is one way to accomplish this.
[128]: What does this look like with fewer bins?
explore_networks.ipynb
[1]: Switch to the relative path ../data
[6-7]: Because you already have the filenames above, you shouldn't need to list them all here manually. See if you can reuse your dictionary or the network_files glob output. It may also be cleaner to have load_network only process a single graph and then iterate over calls to load_network.
[8]: As above, I suggest having a function that operates on one graph and then iteratively calling that function
[11]: Is there an extra node next to D?
[12]: Regarding the graph difference, you can delete that part of the notebook. I was thinking of the symmetric difference operator, but that also requires the graphs have the same node sets. This specific step isn’t critical.
Create a new network: Try using the graph functions to create an empty graph and add nodes and edges
[53]: See if there is a way to do this without reloading the network for each property. If you load the graphs once, you can write functions that take a graph object as input. This would be important if working with large graphs that are slow to read from disk. There is also a more direct way to get the number of nodes without converting to a path.
These comments refer to the cell numbers in the following versions of the notebooks:
You've done a great job overall. There are many ways to accomplish the same task in pandas (and Python more generally) so it may be informative to test out some of the alternatives below. Please comment below if you have questions about these suggestions.
explore_class_data.ipynb
[21]
: Using relative paths is often preferable. That way I could run this notebook even though thePathlinker-project
directory resides in a different place in my file system.os.path
is a nice Python package for working with file paths across multiple operations systems.[21]
: You can test pandashead
function to view only some rows, which is useful for huge tables.[79]
: Seefilter
for another way to do this that is even more general and uses regular expressions https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string[107]
: Another way to do this is to define a function likecalc_grade
and then use pandasapply
to apply that to each row. The function could take in an entire row, use the indices to select only the grade-related columns, and then sum them. Or you could take you existing approach but use the pandas column indexing syntax that allows you to select all columns between "Homework 1" and "Project" (inclusive).[108]
: Try computing the mean for the Biology, CS, and Dance majors separately. Pandasgroupby
is one way to accomplish this.[128]
: What does this look like with fewer bins?explore_networks.ipynb
[1]
: Switch to the relative path../data
[6-7]
: Because you already have the filenames above, you shouldn't need to list them all here manually. See if you can reuse your dictionary or thenetwork_files
glob output. It may also be cleaner to haveload_network
only process a single graph and then iterate over calls toload_network
.[8]
: As above, I suggest having a function that operates on one graph and then iteratively calling that function[11]
: Is there an extra node next toD
?[12]
: Regarding the graph difference, you can delete that part of the notebook. I was thinking of the symmetric difference operator, but that also requires the graphs have the same node sets. This specific step isn’t critical.Create a new network
: Try using the graph functions to create an empty graph and add nodes and edges[53]
: See if there is a way to do this without reloading the network for each property. If you load the graphs once, you can write functions that take a graph object as input. This would be important if working with large graphs that are slow to read from disk. There is also a more direct way to get the number of nodes without converting to a path.