Improved deflatten function

NicoRicardi commented 8 months ago

As of now, deflatten assumed that all items are ordered according to the nesting. Different input would lead to several values being discarded. For instance [('p1.id', 'id1'), ('p1.type', 'type1'), ('p2.id', 'id2'), ('p2.type', 'type2'), ('p1.x', 'x1'), ('p2.x', 'x2')] would cause such issue. Furthermore, if a key appears both as nested and not nested (e.g. 'p1', and 'p1.x') the non-nested value would be discarded without providing any warning.

The current version ensures that:

the output of flatten is independent of the order and no value is discarded
a ValueError with an insightful message is raised if a key appears both as nested and not nested

Two tests have been added to check this. Closes #378

codecov-commenter commented 7 months ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (02da102) 74.55% compared to head (4205cef) 74.63%.

Files	Patch %	Lines
kgforge/core/conversions/dataframe.py	91.66%	2 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #379 +/- ## ========================================== + Coverage 74.55% 74.63% +0.07% ========================================== Files 101 101 Lines 6344 6367 +23 ========================================== + Hits 4730 4752 +22 - Misses 1614 1615 +1 ``` | [Flag](https://app.codecov.io/gh/BlueBrain/nexus-forge/pull/379/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BlueBrain) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/BlueBrain/nexus-forge/pull/379/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BlueBrain) | `74.63% <95.00%> (+0.07%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BlueBrain#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ssssarah commented 7 months ago

I'm not entirely sure why it is necessary to have nested iterations over items? If I understand correctly, you get the pre, and then when you have one pre, you iterate over everything trying to find everything that is a path starting in pre, but I would think there's a way to do it without such time complexity

ssssarah commented 7 months ago

https://stackoverflow.com/questions/54776916/inverse-of-pandas-json-normalize there could be some inspiration here?

ssssarah commented 7 months ago

I've also not looked at it in enough detail but deflatten method seems to be used with rather "controlled" input -> extracted from a dataframe. So the cases for which there are ValueError could not be given as input. It really depends if we see someone use the function as it is and providing their own items, in which case, yes, you should check possibly faulty input. But here, it looks like it's an internal usage function with controlled input where these errors wouldn't be thrown.

NicoRicardi commented 7 months ago

I've also not looked at it in enough detail but deflatten method seems to be used with rather "controlled" input -> extracted from a dataframe. So the cases for which there are ValueError could not be given as input. It really depends if we see someone use the function as it is and providing their own items, in which case, yes, you should check possibly faulty input. But here, it looks like it's an internal usage function with controlled input where these errors wouldn't be thrown.

There is no check - to the best of my understanding - that one does not have a column 'agent' and a column 'agent.something in their DataFrame'. As such the error makes sense

NicoRicardi commented 7 months ago

I'm not entirely sure why it is necessary to have nested iterations over items? If I understand correctly, you get the pre, and then when you have one pre, you iterate over everything trying to find everything that is a path starting in pre, but I would think there's a way to do it without such time complexity

At least at a first glance, the complexity is the same as the previous implementation. I feel like the problem seems easier at first, but I really do not see an easy way out of such complexity. I will look at the link you shared

ssssarah commented 7 months ago

You have both nested loops and recursivity, the complexity is higher in this implementation

ssssarah commented 7 months ago

You have both nested loops and recursivity, the complexity is higher in this implementation

I missed a bit of code, indeed that was already the case

BlueBrain / nexus-forge

Improved deflatten function #379

Codecov Report