RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

code improvement for DrugMechDB #223

Open eKathleenCarter opened 5 months ago

eKathleenCarter commented 5 months ago
          [drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

is more efficient for lines 197 and 220

I would suggest replacing lines 198 through 216 with the following:

df.rename(columns={"dmdb_ids": "drugmechdb_path_id", "qualified_predicates": QUALIFIED_PREDICATE, "object_direction_qualifiers": OBJECT_DIRECTION_QUALIFIER, "object_aspect_qualifiers": OBJECT_ASPECT_QUALIFIER}, inplace=True) df[KNOWLEDGE_LEVEL] = KNOWLEDGE_ASSERTION df[AGENT_TYPE] = MANUAL_AGENT

df['edge_props'] = df.apply(lambda x: x[QUALIFIED_PREDICATE, OBJECT_DIRECTION_QUALIFIER, OBJECT_ASPECT_QUALIFIER, KNOWLEDGE_LEVEL, AGENT_TYPE].dropna().to_dict(), axis=1)

for i, row in df.iterrows():

output_edge = kgxedge(
    subject_id=row["source_ids"],
    object_id=row["target_ids"],
    predicate=row["predicates"],
    edgeprops=row['edge_props'],
    primary_knowledge_source=self.provenance_id
)
self.output_file_writer.write_kgx_edge(output_edge)

Because iterrows is EXTREMELY slow and inefficient

_Originally posted by @eKathleenCarter in https://github.com/RobokopU24/ORION/pull/221#discussion_r1588280747_