JerBouma / FinanceDatabase

This is a database of 300.000+ symbols containing Equities, ETFs, Funds, Indices, Currencies, Cryptocurrencies and Money Markets.
https://www.jeroenbouma.com/projects/financedatabase
MIT License
3.44k stars 387 forks source link

Refactored Equities #31

Closed colin99d closed 1 year ago

colin99d commented 1 year ago

Pandas performance: %timeit fd.select_equities() Normal: 2.16 s ± 6.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) OOP: 111 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit fd.select_equities(country="Germany") Normal: 983 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) OOP: 18.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

JerBouma commented 1 year ago

Reminder to self: SQUASH THE THING!

colin99d commented 1 year ago

I went ahead and coverted all data to csv. Here are the results:

 48K    Categories
4.6M    Cryptocurrencies
1.2M    Currencies
 19M    ETFs
687M    Equities
108M    Funds
 36M    Indices
632K    Moneymarkets
1.4M    cryptos.csv
 92K    currencies.csv
 68M    equities.csv
8.3M    etfs.csv
 41M    funds.csv
5.0M    indices.csv
120K    moneymarkets.csv

Looks like total file size goes from 856.5 MB to 123.9 MB

JerBouma commented 1 year ago

Did a little fixy-fix:

>>> import financedatabase as fd
>>> equities = fd.Equities()
>>> equities.options(selection='sector', country='united states')
array(['Healthcare', 'Basic Materials', 'Financial Services',
       'Industrials', 'Consumer Defensive', 'Real Estate',
       'Consumer Cyclical', 'Technology', 'Communication Services', nan,
       'Services', 'Utilities', 'Energy', 'Consumer Goods',
       'Industrial Goods', 'Financial', 'Conglomerates'], dtype=object)
>>> equities.options(selection='sector', country='United States')
array(['Healthcare', 'Basic Materials', 'Financial Services',
       'Industrials', 'Consumer Defensive', 'Real Estate',
       'Consumer Cyclical', 'Technology', 'Communication Services', nan,
       'Services', 'Utilities', 'Energy', 'Consumer Goods',
       'Industrial Goods', 'Financial', 'Conglomerates'], dtype=object)
>>> 

With:

if capitalize:
    country, sector, industry = country.title(), sector.title(), industry.title()

Because most of the items are always capitalized, I wanted to make sure that when people do not capitalize it still works. I couldn't find a scenario where sector is capitalized but industry or country isn't so all are in this one but it is an argument people can put to False (True by default).

colin99d commented 1 year ago

Looks good to me!

JerBouma commented 1 year ago

One last thing though, we need search to work for multiple queries. Sometimes you might want to delve deeper in your data than just one query, e.g. like this (random example):

>>> equities.search(query="tesla")
           symbol             short_name               long_name  ...  zipcode               website market_cap
127734     TL0.DE    TESLA INC. DL -,001             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
127736      TL0.F    TESLA INC. DL -,001             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
129255    TSLA.BA              TESLA INC             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
129256    TSLA.MI                  TESLA             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
129257    TSLA.MX              TESLA INC             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
129258       TSLA            Tesla, Inc.             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
129259    TSLA.VI              TESLA INC             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
129260  TSLA34.SA        TESLA INC   DRN             Tesla, Inc.  ...    94304  http://www.tesla.com   Mega Cap
131468      TXLZF  TESLA EXPLORATION LTD  Tesla Exploration Ltd.  ...  T2E 4J7                   NaN   Nano Cap

[9 rows x 15 columns]
>>> equities.search(query="tesla").search("exploration", "long_name"
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jeroenbouma/opt/anaconda3/envs/findata/lib/python3.9/site-packages/pandas/core/generic.py", line 5902, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'search'
colin99d commented 1 year ago

Unfortunately equities is an instance of the Equities class, while search returns a DataFrame object. I think the best way to handle this, is to send search terms as kwargs. For example:

equities.search(summary="tesla", long_name="exploration")

Of course this is not the only solution, I would love to know your thoughts!