Closed vob2 closed 4 years ago
We've got a technical problem. Intellect Mergent can only search 500 DUNS numbers a time. We have a total of 1,224,395 DUNS in the data. Is there a way to automate the process so that I don't have to manually search and download 2449 times?
Hi Jie, one clarification: the data on DUNS that I shared included contract IDs so we can merge the financial information back into our sample. This means that one firm appears multiple times in the data - once for each contract it holds.
I checked and there are 44,786 unique DUNS in the sample. It is still quite large though. One way to deal with this issue can be to split the data into ~100 CSVs. I will try to write a short script for this and get back to you.
Thanks. 44,000 makes much more sense than 1.2 million!
Would the script you have in mind will automate the entire process from uploading duns number to downloading the data? If not, I will do it manually. 90 times is not a problem at all.
On Wed, May 27, 2020, 12:28 AM vibhuti6 notifications@github.com wrote:
Hi Jie, one clarification: the data on DUNS that I shared included contract IDs so we can merge the financial information back into our sample. This means that one firm appears multiple times in the data - once for each contract it holds.
I checked and there are 44,786 unique DUNS in the sample. It is still quite large though. One way to deal with this issue can be to split the data into ~100 CSVs. I will try to write a short script for this and get back to you.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/QuickPay-Operational-Performance/Data-and-code/issues/27#issuecomment-634420781, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOBDD5XZO33GVGBIQG34AGDRTSJG5ANCNFSM4NLM27OQ .
I am thinking of only automating the part for splitting into multiple CSVs. We could also automate the entire process using some web scraping tools, but I would not be able to test or debug that code because your VPN and account details are required to access the data. So I think it is better to go with the first approach. Thanks!
Great effort. A suggestion. Before doing the full process, how about downloading a subsample of information, to get an idea of overlap between two datasets and the amount of information about companies present. If only 1% of companies we have are represented in Mergent, we do not have to waste time completing the process.
@vob2 I did that yesterday for 500 DUNS and Mergent covers the majority of them. A problem is that it does not have the financials for many, which I suspect are the small private ones. But we can get the historical sales data on all companies. That's why I want to download the whole thing, match them, and see.
@vibhuti6 Could you send me the script you'd use to automate the process? What scrapping tool do you use? I will see if I can modify it and debug.
Hi Jie, I have used Python in the past to obtain data from the USASpending API. So I’m hoping a modified version of that code would work for this purpose. But that data was open access and didn’t require any password or VPN. I will try to alter that code for automating this process and get back to you. Thanks!
On May 27, 2020, at 8:12 AM, JNing0 notifications@github.com wrote:
@vob2https://github.com/vob2 I did that yesterday for 500 DUNS and Mergent covers the majority of them. A problem is that it does not have the financials for many, which I suspect are the small private ones. But we can get the historical sales data on all companies. That's why I want to download the whole thing, match them, and see.
@vibhuti6https://github.com/vibhuti6 Could you send me the script you'd use to automate the process? What scrapping tool do you use? I will see if I can modify it and debug.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/QuickPay-Operational-Performance/Data-and-code/issues/27#issuecomment-634729207, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMNL4PSTCTY7BNWU4CQRV3RTUUWJANCNFSM4NLM27OQ.
Hi Jie, I have posted a pseudo-code for automating this task here. But this code does not account for vpn or account passwords. Also, line 28 on the code assumes the output will only be a csv -- I cannot verify or fix this without looking at the response from the webpage. Hope this helps a bit. Thanks!
Here is the data from Mergent Intellect. The website does not have an API so I did it manually. @vibhuti6 I did not use the later files that you created. I used the very first file you posted with all DUNS numbers and split them to 90 files.
Thanks!
Hi all, here's a summary of what we have in the Intellect data. It looks like we have the number of employees and sales data for most firms. The rest is quite sparse -- this could still be a sizeable sample when we merge with the USASpending data if these firms hold bulk of the contracts. Code here. Thanks!
“Year 1” is the most recent year, while “year 2” means 2 years ago, etc… And the public companies follow use a fiscal year end.
number_of_firms = Number of firms for which the respective column information is not null
number_of_firms | |
---|---|
D-U-N-S@ Number | 36999 |
Company Name | 36999 |
Sales Volume | 2286 |
Employee All Sites | 0 |
Employee This Site | 0 |
2010 Sales Volume | 28430 |
2010 Employee All Sites | 28430 |
2010 Employee This Site | 28430 |
2011 Sales Volume | 29201 |
2011 Employee All Sites | 29201 |
2011 Employee This Site | 29201 |
2012 Sales Volume | 29802 |
2012 Employee All Sites | 29802 |
2012 Employee This Site | 29802 |
2013 Sales Volume | 30188 |
2013 Employee All Sites | 30188 |
2013 Employee This Site | 30188 |
2014 Sales Volume | 30539 |
2014 Employee All Sites | 30539 |
2014 Employee This Site | 30539 |
2015 Sales Volume | 30812 |
2015 Employee All Sites | 30812 |
2015 Employee This Site | 30812 |
2016 Sales Volume | 31191 |
2016 Employee All Sites | 31191 |
2016 Employee This Site | 31191 |
2017 Sales Volume | 31216 |
2017 Employee All Sites | 31216 |
2017 Employee This Site | 31216 |
2018 Sales Volume | 31039 |
2018 Employee All Sites | 31039 |
2018 Employee This Site | 31039 |
Year5 Total Assets | 367 |
Year4 Total Assets | 379 |
Year3 Total Assets | 382 |
Year2 Total Assets | 382 |
Year1 Total Assets | 382 |
Year5 Current Assets | 363 |
Year4 Current Assets | 374 |
Year3 Current Assets | 377 |
Year2 Current Assets | 377 |
Year1 Current Assets | 377 |
Year5 Total Liabilities | 367 |
Year4 Total Liabilities | 379 |
Year3 Total Liabilities | 382 |
Year2 Total Liabilities | 382 |
Year1 Total Liabilities | 382 |
Year5 Long Term Debt | 259 |
Year4 Long Term Debt | 263 |
Year3 Long Term Debt | 268 |
Year2 Long Term Debt | 272 |
Year1 Long Term Debt | 263 |
Year5 Current Liabilities | 363 |
Year4 Current Liabilities | 375 |
Year3 Current Liabilities | 377 |
Year2 Current Liabilities | 377 |
Year1 Current Liabilities | 377 |
Year5 Total Revenue | 373 |
Year4 Total Revenue | 379 |
Year3 Total Revenue | 382 |
Year2 Total Revenue | 381 |
Year1 Total Revenue | 382 |
Year5 EBITDA | 369 |
Year4 EBITDA | 377 |
Year3 EBITDA | 377 |
Year2 EBITDA | 377 |
Year1 EBITDA | 376 |
Year5 Operating Income | 369 |
Year4 Operating Income | 375 |
Year3 Operating Income | 377 |
Year2 Operating Income | 376 |
Year1 Operating Income | 377 |
Year5 Net Income | 373 |
Year4 Net Income | 381 |
Year3 Net Income | 382 |
Year2 Net Income | 382 |
Year1 Net Income | 382 |
Year5 Cash from Operations | 371 |
Year4 Cash from Operations | 379 |
Year3 Cash from Operations | 379 |
Year2 Cash from Operations | 379 |
Year1 Cash from Operations | 378 |
Year5 Capital Expenditures | 367 |
Year4 Capital Expenditures | 368 |
Year3 Capital Expenditures | 368 |
Year2 Capital Expenditures | 368 |
Year1 Capital Expenditures | 365 |
Year5 Cash & Cash Equivalents, Beginning of Year | 368 |
Year4 Cash & Cash Equivalents, Beginning of Year | 374 |
Year3 Cash & Cash Equivalents, Beginning of Year | 375 |
Year2 Cash & Cash Equivalents, Beginning of Year | 378 |
Year1 Cash & Cash Equivalents, Beginning of Year | 377 |
Year5 Cash & Cash Equivalents, End of Year | 368 |
Year4 Cash & Cash Equivalents, End of Year | 373 |
Year3 Cash & Cash Equivalents, End of Year | 377 |
Year2 Cash & Cash Equivalents, End of Year | 377 |
Year1 Cash & Cash Equivalents, End of Year | 377 |
Year5 Operating Margin | 369 |
Year4 Operating Margin | 375 |
Year3 Operating Margin | 377 |
Year2 Operating Margin | 376 |
Year1 Operating Margin | 377 |
Year5 Profit Margin (TTM) | 373 |
Year4 Profit Margin (TTM) | 379 |
Year3 Profit Margin (TTM) | 382 |
Year2 Profit Margin (TTM) | 381 |
Year1 Profit Margin (TTM) | 382 |
Year5 Quick Ratio | 335 |
Year4 Quick Ratio | 345 |
Year3 Quick Ratio | 347 |
Year2 Quick Ratio | 343 |
Year1 Quick Ratio | 342 |
Year5 Current Ratio | 363 |
Year4 Current Ratio | 374 |
Year3 Current Ratio | 377 |
Year2 Current Ratio | 377 |
Year1 Current Ratio | 377 |
Year5 Net Current Assets as % of Total Assets | 363 |
Year4 Net Current Assets as % of Total Assets | 374 |
Year3 Net Current Assets as % of Total Assets | 377 |
Year2 Net Current Assets as % of Total Assets | 377 |
Year1 Net Current Assets as % of Total Assets | 377 |
Year5 Free Cash Flow per Share | 108 |
Year4 Free Cash Flow per Share | 108 |
Year3 Free Cash Flow per Share | 108 |
Year2 Free Cash Flow per Share | 109 |
Year1 Free Cash Flow per Share | 109 |
Year5 Revenue to Assets | 367 |
Year4 Revenue to Assets | 377 |
Year3 Revenue to Assets | 382 |
Year2 Revenue to Assets | 381 |
Year1 Revenue to Assets | 382 |
Thanks, Vibhuti! Yes, it is worth to check the intersection with USA data contracts. But I would guess we lose quite a few contracts from the sample, and probably from the small firms. Empirical research is hard.
Thanks, Vlad! You are right. I just checked -- there are 9,026 contracts for which we have all of the financial information and only 240 of these are awarded to small businesses.
But there are 810,780 contracts for firms with sales and employee information; 648,994 of which are given to small businesses. At this point, however, it is not entirely clear to me what we can do with the sales and employment information alone. I will think about it.
matched government award data with Compustat. The process is described in Appendix A. It is semi-automated, but I suspect very time consuming and there are few firms that are left in the sample.
Still, something to keep in mind.
Does not seem easily doable. Closing for now
For public firms the financial are from COMPUSTAT (all B-schools have access to that through WRDS https://wrds-www.wharton.upenn.edu/) For private firms perhaps Intellect Mergent dataset that CWRU has would work. I am not sure.
Mergent Online (https://www.mergentonline.com/login.php) is another dataset to try. CWRU has that.
We want all standard variables:
Leverage Inventory | Inventory | Inventory/Total Assets = INVT/AT | Compustat Inventory Cash and marketable secutirities | Cash and Marketable Securities/ Total Assets = CHE/AT | Compustat Accounts receivable | Accounts Receivable/ Total Assets = RECT/AT | Compustat Accounts payable | Accounts Payable/Total Assets = AP/AT | Compustat R&D | R&D/Assets = XRD/AT | Compustat Tobin's Q | Tobin’s Q = ((CSHO*PRCC)+(DLTT+DLC))/AT | Compustat Operating leverage | SG&A/Assets = XSGA/AT | Compustat Investment grade | Indicator that takes a value of one for investment grade long term issue rating or a value of zero otherwise | Compustat Customer sales % | Percentage of sales contributed by the customer | Compustat Expected default frequency | Calculated based on the distance to default from Bhararth and Shumway (2008) | Compustat Trade credit | Calculated as supplier sales to a customer ∗ supplier accounts receivable/total supplier sales | Compustat