alphanome-ai / sec-ai

A comprehensive open-source toolkit for AI-powered analysis and interpretation of SEC EDGAR filings, providing valuable insights for investors, fintech developers, and researchers.
https://sec.alphanome.app
MIT License
112 stars 24 forks source link

Inspect structural similarity between SEC filings with common properties. #44

Closed INF800 closed 1 year ago

INF800 commented 1 year ago

👁️ Inspect structural similarity between SEC filings with common properties

📚 Background

We have CSV files of 10-Q, 10-K, and 8-K filings made between 2003 and 2023. Each filing is represented by a row in the CSV file with the following properties:

  • ticker: Ticker symbol for the filing
  • currency: Currency type for the filing
  • companyName: Company name under which filing was made
  • location: Location associated with the filing
  • isUS: If US location or not
  • filedAt: Filing time string
  • filingYear: Filing year
  • market_cap_category: Company size
  • exchange: Exchange under which the company is listed
  • isDelisted: Is delisted from the stock exchange or not
  • category: Stock category
  • sicSector: Sector associated with the filing
  • sicIndustry: Industry associated with the filing

The URL to download the filing is available in linkToFilingDetails column.

(The plots below show the distribution charts for the above properties) plots

The filings are available in the following Google Drive links:

🔗 10-Q Filings: https://drive.google.com/file/d/1XUBDNS1D52VUGV1B0BsGuY-XdA8IN4E4 🔗 10-K Filings: https://drive.google.com/file/d/1cC1Dd_OR9DRHNQ4eWJ6-2SbwEaevhNl0 🔗 8-K Filings: https://drive.google.com/file/d/1HhdQ7dZWZqqAUMG3HfgWnjFdPPxhNokX

Additionally, unlike 10-Q and 10-K there are multiple types of forms with different items for 8-K. They are represented in column names in the format Item:{item-id}. They are distributed as follows: image

🦠 Problem

We need to figure out if different filing property values correspond to different structures of HTML filings.

For example, the figure below shows different possible sicSector values for 10-Q filings. You need to figure out if filings with different sicSector values are structurally dissimilar or not. One way to do this is to sample 10 HTML filings with sicSector=Mining and another 10 HTML filings with sicSector=Services and manually inspect if they are similar or not. If they are dissimilar, sicSector is a valid property for sampling a representative sample set.

image

Do the same for currency, location, isUS, market_cap_category, exchange, filingYear, isDelisted, category, sicSector, sicIndustry of 10-K, 10-Q and 8-K filings and figure out if their different values correspond to different structures of HTML filings.

🌟 Other Approaches

Automate this by using unsupervised clustering methods.

tumble-weed commented 1 year ago

Hi, I'd like to confirm my understanding of the issue. Taking a concrete example, It is to see if separating out the records on the basis of sicSector leads to records having different "distributions". by "distributions", i mean for example that selecting sicSector=mining effects the frequency and correlations of the remaining properties including filingYear, stockcategory etc.

if so, let us call the property whose values we are segmenting by, as the segmenting property (in this example sicSector), and the remaining as other properties. If we train a classifier to predict the segmenting property from a vector consisting of other properties, in case the segmenting property indeed does separate out the data on the other properties, we should see high accuracy. ( PS: the classifier will be on categorical data, and will predict the class-number of sicSector)

Let me know if my understanding, and approach seem sensible

INF800 commented 1 year ago

Hi @tumble-weed 👋, thank you for showing interest in this issue.

Taking a concrete example, It is to see if separating out the records on the basis of sicSector leads to records having different "distributions". by "distributions", i mean for example that selecting sicSector=mining effects the frequency and correlations of the remaining properties including filingYear, stockcategory etc.

We are not interested in "distributions" in any way 😀

... let us call the property whose values we are segmenting by, as the segmenting property (in this example sicSector), and the remaining as other properties ... we train a classifier to predict the segmenting property from a vector consisting of other properties

That is an interesting idea you got there! You are very close in your understanding but not entirely correct. Instead of seeing how the segmenting property affects all the other properties, we are only interested in seeing how the segmenting property affects linkToFilingDetails.

linkToFilingDetails contains the URL for HTML filing. The purpose of this issue is to find out if these segmenting properties affect the HTML structure of filings. By "HTML structure" I mean anything that may affect the parsing of sec-parser. The easiest way to know it is by simply downloading the filing document and having a look at it.

Let me know if this clears your doubts!

kameleon-ad commented 1 year ago

Hi, @INF800. Nice to meet you here. As we discussed at the discord, I think the purpose of this issues is to find how sector differences differ when it comes to HTML structure.

But in terms of HTML structure, I think it can be the semantic structure of the HTML extracted by sec-parser. Of course, the source of HTML is linkToFilingDetails from the CSV.

INF800 commented 1 year ago

Hi @kameleon-ad 👋,

I think the purpose of this issues is to find how sector differences differ when it comes to HTML structure.

Yes, exactly.

But in terms of HTML structure, I think it can be the semantic structure of the HTML extracted by sec-parser.

Thank you for this important point. By HTML structure I mean the inherent structure of HTML filing available via linkToFilingDetails.

INF800 commented 1 year ago

@kameleon-ad @tumble-weed Also, I just crafted a starter notebook to help you with some other things like the problems you may face while downloading the HTML file using python - This happens because SEC website accepts requests with specific headers.

The notebook code will help you fetch HTML files for each property type and put them in the following folder structure.

└── 10-Q
    ├── exchange
    │   ├── BATS_n50
    │   ├── NASDAQ_n50
    │   ├── Not Available_n50
    │   ├── NYSEARCA_n50
    │   ├── NYSEMKT_n50
    │   ├── NYSE_n50
    │   └── OTC_n50
    ├── isDelisted
    │   ├── False_n50
    │   ├── Not Available_n50
    │   └── True_n50
    ├── isUS
    │   ├── False_n50
    │   ├── Not Available_n50
    │   └── True_n50
    ├── market_cap_category
    │   ├── Large ($10-200B)_n50
    │   ├── Medium ($2-10B)_n50
    │   ├── Mega (>$200B)_n50
    │   ├── Micro ($50-300M)_n50
    │   ├── Nano ($0-50M)_n50
    │   ├── Not Available_n50
    │   └── Small ($0.3-2B)_n50
    ...

The notebook can be found here Open In Colab

Risdan224 commented 1 year ago

Hi, @INF800, nice to meet you

Maybe we just need to add some features in the CSV file containing information about the structure of the HTML Filling. But the first thing to do is define the "information structure for the HTML Filling", for example: the number of 'table of contents' item, number of tables in the HTML, etc.

Then after we define the information structure HTML Filling, We fill in the data in the CSV file. There are two ways to do this:

  1. As you said, do it manually, not to all HTML Filling, but using samples (ex: 20 doc from every unique values in the features: currency, location, isUS, market_cap_category, exchange, filingYear, isDelisted, category, sicSector, sicIndustry)
  2. Using sec-parser to find the information structure

Then using clustering to cluster every HTML Filling, by the cluster we got from clustering, we calculate the correlation of cluster to every feature(currency, location, isUS, market_cap_category, exchange, filingYear, isDelisted, category, sicSector, sicIndustry) so we can know which features make a difference in HTML filling.

I am sorry it's just a raw idea, I haven't done it yet.

INF800 commented 1 year ago

Hello @Risdan224, thank you for showing interest in this issue! You are right, we can leverage unsupervised clustering methods for this task. Inspecting manually is not a viable option when dealing with hundreds of thousands of files, it is just a quick solution for now.

If you want to use clustering methods you can create and experiment with your own feature sets. There are some interesting but a bit old projects like html-cluster based on page-compare project which can help you in feature engineering. Please have a look at it and let me know what you think of it. I'd suggest starting with simple unsupervised methods first.

INF800 commented 1 year ago

Hello @tumble-weed @Risdan224 @kameleon-ad, Please feel free to share your findings whenever you get a chance. I was actively working on it this past week and would like to discuss your perspectives on this.