WallStreetAnalytics / wallstreetanalytics

An endeavor to create an analytics tool to democratize the information hedge funds are creating teams to collect.
813 stars 30 forks source link

Step zero: Could we vote on a common set of goals? #16

Open notjoshjames opened 3 years ago

notjoshjames commented 3 years ago

I feel paralyzed by the staggering momentum we have here, and I've heard from a few in the Discord that may feel similarly.

There's a lot of early discussion around (potentially) big decisions like choosing a language or framework. I wonder if that may be possibly counterproductive.

One of the beautiful things about git and the internet as whole is that everyone can contribute, regardless of language or medium. The consensus I see here is that we're tired of the status quo... perhaps our next consensus could be deciding on some core set of functionalities that would universally benefit the public?

The first thought I had is more accessible data = better deep dives and discussions around securities and markets. I watched a lot of people with great intentions spread inaccurate data (I'm thinking of short interest in particular) because a primary data source (like the SEC or FINRA) did not have a clear standard for reporting certain data. A data aggregator, pulling from a number of public bodies, with a standardized schema, could definitely provide a lot of potential time savings to anyone who wants to model.

As this repository is clearly labeled "WSB Analytics," a subsequent goal seems to include other signal consolidation and interpretation by scraping WSB (and r/options, r/investing, etc) but I wonder if there could also be use including values such as reddit user history to help weed out bad actors. This could certainly have far-reaching benefits to the internet as a whole - could we start building an open-source library around clearly identifiable shill accounts?

Honestly, a good starting point might be for everyone to list 3 features they would like to see in an open source financial market. The DTCC is a single, central point of failure for the entirety of the system. It's also a private company, and that's just fundamentally a danger.

I'd like to see:

  1. the ability to navigate large datasets from the SEC, FINRA in a standardized format (with params like last_updated)
  2. the ability to embed / validate these numbers with a signature
  3. the ability to combine other legal filings that may not be recorded by a financial regulatory body
  4. the ability to monitor financial activities by persons of interest
DrewMcArthur commented 3 years ago

+1 for sure; i filed #10 but you’re totally right to call this step 0. for now, i think discussion around values and goals is important, and ideally we start creating sub-repos (ive used this word a few times but not actually sure it’s a real thing) for categories of stuff, like data mining, data processing, etc.

i’d really like to help figure out the best platform and tools to get ideas communicated and decided on, consensus reached etc, so that we can move forward with confidence that it’s the “right” direction, or at least the direction that the people as a whole want to go in.

i’m all for things being as decentralized as possible, power spread out, information transparent and accessible, all of that good stuff.

hackerncoder commented 3 years ago

Sub-repos do not exist (AFAIK). We would have to create other repos. That's why I am pushing for an organization, people that are in the organization have cross repo rights (e.g. write), whereas (I believe) it would have to be done manually for each repo if it was done on a user account.

That's why I propose step 0 is to organize ourselfs.

athielen commented 3 years ago

I'd like to see:

  1. the ability to aggregate data from many different sources into an open format for time series and graph data storage
  2. the ability to overlay different sources of data to potentially correlate empirical market data with social media data (think drop in price overlaid with increase posting from known shill accounts)
  3. Open source development on tools/scripts that could rival a data science team and bloomberg terminal.
  4. the ability to take the entire project and run it on my homelab at home or a server in sweden or aws or gcp.
Bradfordly commented 3 years ago

This is really important!! My features are going to be more focused on maintainable design.

  1. De-coupled, containerized services to distribute the work of handling data pipelines that are built behind interfaces that can integrate with cloud native services from a variety of providers (AWS, Azure, etc). This has the added benefit of greatly expanding our surface area of who can contribute to this project. Not all backend services need be written in the same language.
  2. Rigorous testing standards as part of code-review to ensure that we not only have confidence that our platform is working as expected, but that our data is reliable. >90% code coverage, behavior driven tests, e2e tests, and all of that jazz.

I am struggling to think of a third... will try and edit tomorrow with one.

qcasey commented 3 years ago

De-coupled, containerized services to distribute the work of handling data pipelines that are built behind interfaces that can integrate with cloud native services from a variety of providers (AWS, Azure, etc).

It might go without saying but I'm going to say it anyway: easily self-hostable.

Containers solve this to a point, but I'd like to see cloud-free instances as a default. Just substituting S3 with minio, for example, would go a long way in making these analytics repeatable for anyone with a docker-compose

EDIT: Also completely agree with your 2. code testing goal, very important

honggyu420 commented 3 years ago

+1 for sure; i filed #10 but you’re totally right to call this step 0. for now, i think discussion around values and goals is important, and ideally we start creating sub-repos (ive used this word a few times but not actually sure it’s a real thing) for categories of stuff, like data mining, data processing, etc.

i’d really like to help figure out the best platform and tools to get ideas communicated and decided on, consensus reached etc, so that we can move forward with confidence that it’s the “right” direction, or at least the direction that the people as a whole want to go in.

i’m all for things being as decentralized as possible, power spread out, information transparent and accessible, all of that good stuff.

Agreed. It's fun to talk about tech stacks and architecture but I think a lot of people are jumping the gun here; the project needs to decide what it wants/needs to do from a high level before we talk about technology. Like, what kind of data needs to be collected and what the tools are going to do with these datasets.

dannyseymour2 commented 3 years ago

I'd say my "to the moon" goal would be an open source Bloomberg terminal. I like @notjoshjames 's goals, and @athielen's as well, and would add

  1. A single pane of glass UX display on the frontend to organize all this info and make it accessible and easily usable
  2. Make this accessible to people without much financial knowledge. If the overarching goal is to democratize big financial data, we should make sure that even if someone can't calculate Black-Scholes they can still make data-driven investment decisions in options.
porkbuffet commented 3 years ago

I think that data aggregation & analysis are good objectives, but there are also other directions we could explore (if the overall goal is to further democratize markets). One idea could be to create a decentralized framework for peer to peer real time order sharing over tcp (via websockets). The imagined framework would have a mechanism to directly compensate individuals for their order flow. The deliverables that I'm picturing are a smart contract & non-profit org as a cloud host. This is just a rough idea, but hopefully helps expand the discussion.

liamweldon commented 3 years ago

I think data aggregation/sanitization/analysis from public datasets like SEC filings is important for being able to quickly analyze stock fundamentals, and would definitely want to see that available in open source. I think it would go a long way to truly democratizing finance if "public" data didn't have an entry fee of hundreds to thousands of dollars a year to access it in a usable format. However, if the goal of this repo is to provide a public answer to organizational social media data mining, I would like to see:

  1. Sentiment analysis of stocks trending on reddit/twitter/other social media, ideally filtering out bad actors.
  2. Graphing sentiment analysis changes of trending stocks in sentiment over time, possibly overlaid over changes in the stocks price.
  3. Search functionality for a given ticker to sentiment analysis.
Objective-Resolve commented 3 years ago

I think data aggregation/sanitization/analysis from public datasets like SEC filings is important for being able to quickly analyze stock fundamentals, and would definitely want to see that available in open source. I think it would go a long way to truly democratizing finance if "public" data didn't have an entry fee of hundreds to thousands of dollars a year to access it in a usable format. However, if the goal of this repo is to provide a public answer to organizational social media data mining, I would like to see:

  1. Sentiment analysis of stocks trending on reddit/twitter/other social media, ideally filtering out bad actors.
  2. Graphing sentiment analysis changes of trending stocks in sentiment over time, possibly overlaid over changes in the stocks price.
  3. Search functionality for a given ticker to sentiment analysis.

I absolutely agree with this, and feel we should be looking at a few different factors: sentiment (AFINN is my personal favorite, but we'll want to also eventually build out our own dictionary due to slang term usage); FK readability (good general summary of length, word choice and so on); categorical (dd, speculation, reporting (loss/gains), inquiry and so on - we can expand that out); & reddit API data.

I've worked on a project in Python that does the sentiment and FK readability on a corpus, but I'm not sure how to scrape reddit to get that data. However, if we can get the data from the principle stock subs, then I can build out sentiment and FK side, and start work on the categorization (which might be an ongoing WIP).

dylanhitt commented 3 years ago

I'd like to see:

Ease of portability - whether it's for guys hosting this on home labs for small groups of friends or people attempting to host this at scale on a major cloud provider. I think keeping in mind that most people that will be self hosting won't have the budgets too afford robust home labs will be important for democratized use.

This opinion is probably way to early to be honest. I'm just way to excited to not express it.

Cheers