JerBouma / FinanceDatabase

This is a database of 300.000+ symbols containing Equities, ETFs, Funds, Indices, Currencies, Cryptocurrencies and Money Markets.
https://www.jeroenbouma.com/projects/financedatabase
MIT License
3.6k stars 401 forks source link

industry field anomalies #6

Closed mrx23dot closed 3 years ago

mrx23dot commented 3 years ago

https://github.com/JerBouma/FinanceDatabase/raw/master/Database/Equities/Countries/United%20States/United%20States.json

"industry" field contains the following anomaly:


"Banks\u2014Diversified",
"Banks\u2014Regional",
"Beverages\u2014Brewers",
"Beverages\u2014Non-Alcoholic",
"Beverages\u2014Wineries & Distilleries",

"Drug Manufacturers\u2014General",
"Drug Manufacturers\u2014Specialty & Generic",
"Insurance\u2014Diversified",
"Insurance\u2014Life",
"Insurance\u2014Property & Casualty",
"Insurance\u2014Reinsurance",
"Insurance\u2014Specialty",
"Real Estate\u2014Development",
"Real Estate\u2014Diversified",

these could use some normalization "Aerospace & Defense", "Aerospace/Defense - Major Diversified", "Aerospace/Defense Products & Services",

It's a lot faster to work on offline database, cheers!

JerBouma commented 3 years ago

Ah, I fixed this within my search_products function. I will look into adjusting the JSON files as well!

Perhaps this might not work due to the fact '&' is usually not supported in filenames. Might change it to "and" in that case.

JerBouma commented 3 years ago

Fixed the issue. The problem was that \u2014\ referred to a special form of the dash ( - ) sign. This makes it very difficult to select that type of data. See the fix (for example): Banks - Diversified.

I also updated strings like Aerospace/Defense Products & Services to Aerospace Defense Products & Services because the forward slash usually gives problems with filenames.

Please let me know if you find more weird unicodes!