alvarobartt / investpy

Financial Data Extraction from Investing.com with Python
https://investpy.readthedocs.io/
MIT License
1.62k stars 377 forks source link

Always include "instrument type" and "exchange" to solve problem with duplicates #154

Open markus080402 opened 4 years ago

markus080402 commented 4 years ago

Hi @alvarobartt ,

When fetching stocks using for example investpy.stocks.get_stocks(country=None) we get lots of duplicates of the 3-tuple 'symbol', 'country' and 'isin'.

The reason for the duplicates of this 3-tuple is that the same 3-tuple can exist in multiple exchanges, for example symbol 55O1, country germany and isin US03761U5020 exists on three different exchanges: Berlin, Stuttgartand Frankfurt.

Hence, the investpy.stocks.get_stocks(country=None) returns three identical rows for the stock symbol 55O1 since no information about 'instrument type' nor 'exchange' is included in the response:

https://www.investing.com/equities/apollo-investment?cid=6526

image

A way to solve this is to always include the instrument type, "stock" in the above screenshot, and the "exchange", which in this case are Berlin, Frankfurt and Stuttgart.

This way I think we will always have a unique 5-tuple:

  1. symbol
  2. country
  3. isin
  4. instrument type
  5. exchange

The problem today with the 3-tuple

  1. symbol
  2. country
  3. isin

Can be seen with the following example

import investpy
all_stocks = investpy.stocks.get_stocks(country=None)
len(all_stocks)
39952
df=all_stocks[['country','isin','symbol']]
df_sorted = df.sort_values(['symbol','country','isin'])

df_sorted[df_sorted.duplicated(keep='first')]
      symbol  country          isin
10053   55O1  germany  US03761U5020
23398   63MO    india  INE111B01023
26187   6724    japan  JP3414750004
10608    6LA  germany  US5128161099
26937   7203    japan  JP3633400001
...      ...      ...           ...
23653   YESB    india  INE528G01027
23654    ZEE    india  INE256A01028
23993   ZENT    india  INE520A01027
24351   ZURI    india  INE217A01012
23655   ZYDS    india  INE768C01010

[776 rows x 3 columns]

duplicates = df_sorted[df_sorted.duplicated(keep=False)]
>>> duplicates
      symbol  country          isin
9824    55O1  germany  US03761U5020
10053   55O1  germany  US03761U5020
22951   63MO    india  INE111B01023
23398   63MO    india  INE111B01023
25543   6724    japan  JP3414750004
...      ...      ...           ...
23993   ZENT    india  INE520A01027
23278   ZURI    india  INE217A01012
24351   ZURI    india  INE217A01012
23279   ZYDS    india  INE768C01010
23655   ZYDS    india  INE768C01010

[1547 rows x 3 columns]

Hence, out of the 39952 stocks we have 1547-776 = 771 duplicates.

However, if we include "instrument type" and "exchange" then all 39952 stocks will have unique 5-tuples and hence the duplicate problem solved.

Does this sound feasible?

Regards, Markus

markus080402 commented 4 years ago

Hi @alvarobartt

There is also S/N that could be included in the response from investpy.stocks.get_stocks(country=None)

image

Hence, instead of 5-tuple (as desrcibed above) we would have a 6-tuple that uniquely identifies a stock:

  1. symbol
  2. country
  3. isin
  4. instrument type
  5. exchange
  6. S/N

Regards, Markus

markus080402 commented 4 years ago

HI @alvarobartt ,

Just a heads up that companies sometimes change name and symbol, and I think they are also required to change the isin in this process, not sure if they can reuse the isin.

I think all the follwoing "get" functions could be amended with the 5-tuple to uniquely identify a stock, both as input parameter and output. Could be good to support less than the 5-tuple as input parameter, for example only stock symbol, but then the output can return multiple matches and hence each stock would need to be identified with the unique 5-tuple.

investpy.stocks.get_stocks() investpy.get_stock_recent_data() investpy.get_stock_historical_data() investpy.get_stock_company_profile() investpy.get_stock_financial_summary()

Thanks and regards, Markus

alvarobartt commented 4 years ago

Hi @markus080402, thank you for the information! For example, the stock_exchange parameter, is currently available while retrieving ETFs data, so that it is an optional parameter, which means that if you introduce the name and the country from an ETF with more than one stock_exchange, the function will return by default the data from the default/main stock exchange, but the function will also display a warning so as to let the user know that there are more stock exchanges from where that information can be retrieved.

I am planning to include the same features to all the functions that contain data that may differ on the same financial product (such as the one mentioned above, stocks); which combined with the input_type parameter, will let you specify if you are introducing the ISIN instead of the Symbol, for example.

Thank you! Stay tuned!

markus080402 commented 4 years ago

Hi @alvarobartt

Look forward to the new functionality :)

As a user, I would like to be sure that I always can specify a unique stock and always get a unique stock returned from the functions above, and there is no possibility of any ambiguity.

The same symbol and isin might be introduced in several exchanges in the same country, and I don't want the legacy code to break because of this.

With that said, it is safe to always include the 5-tuple when identifying the stock, leaving for example isin or symbol out might result in ambiguities.

Thanks and regards, Markus

markus080402 commented 4 years ago

Hi @alvarobartt

You probably know this but different financial platforms (yahoo finance, google finance, et al) sometimes have different symbols compared to investing.com , but isin should be the same across platforms, so for portability it is good to always include isin.

Thanks and regards, Markus

markus080402 commented 4 years ago

Hi @alvarobartt ,

Here is another observation, searching for example for LU0156801721 provides 8 different results at investing.com , however, stocks.csv only has 3.

image

Regards, Markus