RWTH-EBC / FiLiP

FIWARE Library for Python (FiLiP) to work with FIWARE API
BSD 3-Clause "New" or "Revised" License
22 stars 13 forks source link

Can the installation of pandas/numpy be optimized/made optional #194

Open windrad6 opened 1 year ago

windrad6 commented 1 year ago

After trying to install FiLiP on a Raspberry PI I ran into issues with pandas/numpy. I checked the code and pandas is mainly used in examples, tests and tutorials. For the actual library, it is only used in models/ngsi_v2/units.py for load_units().

Can we make it optional?

It is one of the main drivers for time consumption during setup and a source of incompatibilities on different operating systems.

djs0109 commented 1 year ago

Hi, I see the pandas and numpy are also being used in timeseries.py. Do you have an idea how to change it in a clean way? Otherwise, a local branch or a fork would be the quick and dirty solution

github-actions[bot] commented 1 year ago

Branch 194-Can-the-installation-of-pandas/numpy-be-optimized/made-optional created!

Maghnie commented 1 year ago

@windrad6 Could you give a bit more info about the Raspberry PI OS and which commands exactly are you using to install filip?

tstorek commented 1 year ago

@windrad6 The TimeSeriesApi strongly relies on pandas as it is designed to parse the data to a format that users can directly trough into theier analysis algorithms. Making pandas optional would have two consequences:

  1. TimeSeries Objects are not convertible to Pandas Dataframes directly. Furthermore, it most likely requires restructuring the code because the imports would be missing in the TimeSeries Models. An inline import in the pandas correlated functions could fix this.
  2. The Code for unit validation would become far more complicated without pandas and require a lot of refactoring to convert it to a basic dict-based code. This should convert the Units to dicts that has either the "Name" or "Code" as keys(). The csv-files in the data folder may prohibit the automatic download via "load_datapackage". Hence, again this would require an inline import in the "load_datapackage" function. I think that is already partially implemented in the get_keys() function ;)

The installation of numpy/pandas on raspberry pi is a known issue and it is a pitty that there is no real way to fix that without waiting for hours. Nevertheless, it is probably not the worst idea to reduce dependencies of the library if not required :)

Maghnie commented 1 year ago

@tstorek Would removing the dependency on pandas become less of a hassle if filip had a separate "lite" installation setup?

Most of the pandas usages are in the examples and tests, which the regular filip user doesn't need.

We would "just" have to figure out how to deal with the usages in Units.

tstorek commented 1 year ago

@Maghnie this is incorrect. Examples and tests are independent from the actual package installation. Removing pandas would partly break the timeseries models as already pointed out. Nevertheless, it is probably worth thinking of making unit validation not as strict as it is implemented right now or as an extension package. The current implementation may change with the migration to pydantic v2 (#199) anyway.

The decision on this is up to the core-developer-team. (@djs0109)

djs0109 commented 1 year ago

@Maghnie I find your idea of creating a "lite" version good. Do you already have an idea of what functionalities are not required? For example, Unit, TimeSeries API, and so on. I am not sure if a "lite" version can be made within the same repo, or should you make a fork? But as for now, I can only promise support in reviewing because our funded projects have not yet encountered such a requirement.

Maghnie commented 1 year ago

this is incorrect.

What is correct then is the practice of making controversial statements to insure a speedy reply, which is confirmed here. :)

Removing pandas would partly break the timeseries models as already pointed out.

I'm not closely familiar with the code but after skimming it, I'm not sure how removing pandas breaks the TimeSeries models, since its to_pandas method seems to be used only in tests. image

Maghnie commented 1 year ago

Do you already have an idea of what functionalities are not required? For example, Unit, TimeSeries API, and so on.

Yes, I would start there too. We could later look into other bulky packages, if needed.

I am not sure if a "lite" version can be made within the same repo, or should you make a fork?

I imagine it can be done in the same repo, but just with different installation instructions.

But as for now, I can only promise support in reviewing because our funded projects have not yet encountered such a requirement.

Sure, I understand. Filip works in I-GReta too, it just takes a lot of time (aka money) for the installation.

tstorek commented 1 year ago

@Maghnie Let me clarify. to_pandas is a functionality provided by filip to the user. This would not be possible without pandas. Hence, removing pandas would break this functionality of the model. But as pointed it out this would need some rafactoring to make it optional.

Examples only use it for demonstration reasons and Tests insure that it works as expected. Usually, when using installation via pip the user does not even regognize their existence.

The Unit model is more complex and would require a little more refactoring because it uses pandas to manage all unit-information. Like a tiny database. Furthermore, the attribute models automatically enforce unit validation and auto-completion if some special keywords are used. Especially, on edge-device level this prohibts falsy unit usage. But apparently to the mentioned use-case it comes with some drawbacks that I was not aware of when I implemented that feature.

Nevertheless, personally, I agree with you that making both functionalities optinal is probably a good idea. But it seems easier as you might expect. That is what I tried to say. Nevertheless, splitting the library into several parts will most likely favor its usabilty in different scenarios.

BTW: I suggest to have a look into optional imports and extra dependencies before starting an additonal repo:


try:
   import pandas as pd
   has_pandas = True
ecxept ImportError():
  has_pandas = False

if has_pandas:
  do something with pandas....
else:
  raise ImportError() 

Furthermore, you need to adjust the setup.py and add extras. Have a look here: https://github.com/RWTH-EBC/ebcpy/blob/master/setup.py

You can find further exmaples online.

Maghnie commented 3 weeks ago

@windrad6 As a workaround for the time-consuming setup, could you try installing Filip with uv?

uv seems to be more efficient than "normal pip" as an installer and package resolver.