Closed INF800 closed 1 year ago
Hi,
I'd like to confirm my understanding of the issue. Taking a concrete example, It is to see if separating out the records on the basis of sicSector
leads to records having different "distributions". by "distributions", i mean for example that selecting sicSector=mining
effects the frequency and correlations of the remaining properties including filingYear
, stockcategory
etc.
if so, let us call the property whose values we are segmenting by, as the segmenting property (in this example sicSector
), and the remaining as other properties. If we train a classifier to predict the segmenting property from a vector consisting of other properties, in case the segmenting property indeed does separate out the data on the other properties, we should see high accuracy. ( PS: the classifier will be on categorical data, and will predict the class-number of sicSector
)
Let me know if my understanding, and approach seem sensible
Hi @tumble-weed 👋, thank you for showing interest in this issue.
Taking a concrete example, It is to see if separating out the records on the basis of sicSector leads to records having different "distributions". by "distributions", i mean for example that selecting sicSector=mining effects the frequency and correlations of the remaining properties including filingYear, stockcategory etc.
We are not interested in "distributions" in any way 😀
... let us call the property whose values we are segmenting by, as the segmenting property (in this example sicSector), and the remaining as other properties ... we train a classifier to predict the segmenting property from a vector consisting of other properties
That is an interesting idea you got there! You are very close in your understanding but not entirely correct. Instead of seeing how the segmenting property affects all the other properties, we are only interested in seeing how the segmenting property affects linkToFilingDetails.
linkToFilingDetails contains the URL for HTML filing. The purpose of this issue is to find out if these segmenting properties affect the HTML structure of filings. By "HTML structure" I mean anything that may affect the parsing of sec-parser. The easiest way to know it is by simply downloading the filing document and having a look at it.
Let me know if this clears your doubts!
Hi, @INF800. Nice to meet you here. As we discussed at the discord, I think the purpose of this issues is to find how sector differences differ when it comes to HTML structure.
But in terms of HTML structure, I think it can be the semantic structure of the HTML extracted by sec-parser.
Of course, the source of HTML is linkToFilingDetails
from the CSV.
Hi @kameleon-ad 👋,
I think the purpose of this issues is to find how sector differences differ when it comes to HTML structure.
Yes, exactly.
But in terms of HTML structure, I think it can be the semantic structure of the HTML extracted by sec-parser.
Thank you for this important point. By HTML structure I mean the inherent structure of HTML filing available via linkToFilingDetails
.
@kameleon-ad @tumble-weed Also, I just crafted a starter notebook to help you with some other things like the problems you may face while downloading the HTML file using python - This happens because SEC website accepts requests with specific headers.
The notebook code will help you fetch HTML files for each property type and put them in the following folder structure.
└── 10-Q
├── exchange
│ ├── BATS_n50
│ ├── NASDAQ_n50
│ ├── Not Available_n50
│ ├── NYSEARCA_n50
│ ├── NYSEMKT_n50
│ ├── NYSE_n50
│ └── OTC_n50
├── isDelisted
│ ├── False_n50
│ ├── Not Available_n50
│ └── True_n50
├── isUS
│ ├── False_n50
│ ├── Not Available_n50
│ └── True_n50
├── market_cap_category
│ ├── Large ($10-200B)_n50
│ ├── Medium ($2-10B)_n50
│ ├── Mega (>$200B)_n50
│ ├── Micro ($50-300M)_n50
│ ├── Nano ($0-50M)_n50
│ ├── Not Available_n50
│ └── Small ($0.3-2B)_n50
...
The notebook can be found here
Hi, @INF800, nice to meet you
Maybe we just need to add some features in the CSV file containing information about the structure of the HTML Filling. But the first thing to do is define the "information structure for the HTML Filling", for example: the number of 'table of contents' item, number of tables in the HTML, etc.
Then after we define the information structure HTML Filling, We fill in the data in the CSV file. There are two ways to do this:
Then using clustering to cluster every HTML Filling, by the cluster we got from clustering, we calculate the correlation of cluster to every feature(currency, location, isUS, market_cap_category, exchange, filingYear, isDelisted, category, sicSector, sicIndustry) so we can know which features make a difference in HTML filling.
I am sorry it's just a raw idea, I haven't done it yet.
Hello @Risdan224, thank you for showing interest in this issue! You are right, we can leverage unsupervised clustering methods for this task. Inspecting manually is not a viable option when dealing with hundreds of thousands of files, it is just a quick solution for now.
If you want to use clustering methods you can create and experiment with your own feature sets. There are some interesting but a bit old projects like html-cluster based on page-compare project which can help you in feature engineering. Please have a look at it and let me know what you think of it. I'd suggest starting with simple unsupervised methods first.
Hello @tumble-weed @Risdan224 @kameleon-ad, Please feel free to share your findings whenever you get a chance. I was actively working on it this past week and would like to discuss your perspectives on this.
👁️ Inspect structural similarity between SEC filings with common properties
🦠 Problem
We need to figure out if different filing property values correspond to different structures of HTML filings.
For example, the figure below shows different possible
sicSector
values for 10-Q filings. You need to figure out if filings with differentsicSector
values are structurally dissimilar or not. One way to do this is to sample 10 HTML filings withsicSector=Mining
and another 10 HTML filings withsicSector=Services
and manually inspect if they are similar or not. If they are dissimilar,sicSector
is a valid property for sampling a representative sample set.Do the same for
currency
,location
,isUS
,market_cap_category
,exchange
,filingYear
,isDelisted
,category
,sicSector
,sicIndustry
of 10-K, 10-Q and 8-K filings and figure out if their different values correspond to different structures of HTML filings.🌟 Other Approaches
Automate this by using unsupervised clustering methods.