RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

Update loadKinAce.py #220

Closed beasleyjonm closed 1 month ago

beasleyjonm commented 6 months ago

Updated qualifier predicates to properly represent phosphorylation events.

EvanDietzMorris commented 6 months ago

Looks OK to me except we want to use the constants where applicable. The KL/AT ones are currently still on another branch, I can fix this one up after the merge. "qualifier_predicate" should be "qualified_predicate" but that will be fixed by switching to the constant too.

Are the primary/secondary sources something that should be represented with the biolink model? I assume they're not just publications..

beasleyjonm commented 6 months ago

The primary and secondary sources are other datasets of detected phosphorylation events, so they are not publications.

eKathleenCarter commented 1 month ago

@EvanDietzMorris please see commit updating the KinAce parser.

EvanDietzMorris commented 1 month ago

This is moving in the right direction, I definitely think it's better to just parse the data file (streaming it) instead of writing another csv and then reading it again. But there are major issues here: 1) Is this the same data? The fields don't seem to match what JMB had.. this one does have Reference (PMID), but not the primary and secondary sources he was parsing? 2) The edge file header enum was changed, but it isn't used and doesn't seem to match the data. The idea is that we use that mapping instead of hard coding indexes like line[7] in the parser. It also seems to start with index 1 instead of 0. 3) The edge properties should be using constants for the keys if they exist ie knowledge_level, primary_knowledge_source.. most of them have constants. 4) The primary knowledge source needs to be a valid infores ID, looks like we need mappings or to pick one for KinAce. 5) We don't want to call get_latest_source_version in the constructor, it should be called by other parts of ORION when/if needed. We have parsers that violate this, but it's for weird edge cases where we need the version to determine file names or something like that. 6) Looks like this source does have PMIDs, we should probably add a publications property, split the list, and add them as PMID:123123 7) This code doesn't work.

eKathleenCarter commented 1 month ago

1) updated the enum to match new data 2) added infores mapping

  1. a) submitted PR to add infores ids for the primary knowledge sources biolink/information-resource-registry#23 3) added functions for more complicated parts of extractor 4) used csv_extractor instead of parse row 5) added PMIDs

I also added logic to the extractor to skip blank lines. Without the if statement, I had errors for index out of range.

Do we want to return null or a blank list on edges without publications, or not include that property?

EvanDietzMorris commented 1 month ago

Looks good.. I made a few changes:

After running this through the whole pipeline I see a lot of edges get merged (12,703).. this would cause properties that are single values but have varying values to be lost/overwritten in merges. I made phosphorylation_sites a list so that the values are preserved.

I removed the change to the extractor, because having a line with all empty fields is not something we should encounter often (ever, except this source apparently), and checking for individual missing values is already built in to the extractor. Instead, I moved the error catching to the KINACE_INFORES_MAPPING lookup in the parser, which was where the KeyError actually occurred.

I finished switching over to constants in parse_data.

I changed the KL_AT_assignments function so that the function and variables do not start with capital letters - functions should never start with capitals and variables should only if they are constants that don't change.