Closed ericmjl closed 2 years ago
Wow, that was a flurry of commits. Thanks for that! Will be going through today.
Ah point taken wrt to imports. I was hoping to keep them as optional dependencies to avoid too much dependence on external libraries at the expense of best practice.
For the nesting, I think this is also a very good point. This was mostly done to organise some similar sounding functions (e.g. feature functions that can operate on nodes or graphs). But I think clever naming conventions (& possible multipledispatch) can help to bring some clarity if we keep them in a higher-level namespace.
Actually, optional imports are not a bad idea too! I have a pattern that is used in pyjanitor: https://github.com/ericmjl/pyjanitor/blob/dev/janitor/chemistry.py. Here, the optional dependencies are kept in a try/except block, and then we raise an informative error message that tells the user how to install those dependencies.
As I go through the codebase to write tests, I'm noticing a few architectural issues to the package becoming a pydata-friendly package. I'm going to document here issues/anti-patterns that I found inside the codebase, and propose solutions.
Issue 1: Implicit dependencies
Found
propy
,biovec
,ipymol
,pytorch3d
as imports inside functions.This is a problem because when the functions get called and the dependency isn't explicitly stated as part of
environment.yml
orrequirements.txt
, then code will unexpectedly error out for a user that does not have those packages installed.One possible resolution is to add those packages as explicit dependencies. (Explicit is better than implicit, according to the Zen of Python.)
Issue 2: Highly nested codebase
To access protein amino acid featurization, one has to
import graphein.protein.features.nodes.amino_acids
. That's potentially quite a lot of boilerplate for the end-user.The easiest resolution path is to import those submodules, or even explicit functions, into a higher-level namespace. For example, we might want to follow NetworkX's and PyMC3's pattern, which imports all of the important items into the top-level namespace. Or we might want to follow a pattern where things related to protein are imported into the
protein
namespace.