ahirner / TabulaRazr-OS

Extract tabular data and semantically discover it with ease! (OS)
GNU Affero General Public License v3.0
21 stars 3 forks source link

Make XIRR calculation working for more cases #1

Open ahirner opened 8 years ago

ahirner commented 8 years ago

This includes three steps in the process:

1) Identifying the right tables with better heuristics. One simple idea is to use more and/or better filter terms. For now it's just merely one term, e.g. for the funds table [here].(http://tabularazr.eastus.cloudapp.azure.com:7081/static/filters/funds.json). A more sophisticated approach is to use semantic search instead of fuzzy keyword search only.

2) Identifying the right right key-value pairs in the funds table. E.g. the funds table is now only searching for four terms: 'Underwriter Discount', 'Issue Premium', 'Principal Amount', 'Costs of Issuance' @joffemd: Can you come up with a list of synonyms and alternative calculations paths for examples that are most important to the project? In this example, it seems to me the table is correctly identified and thus it's about the 2nd step.

joffemd commented 8 years ago

I think the problem may be with handling a number in parenthesis. If you see Underwriter's Discount in parenthesis (which means negative), just take the absolute value,

Here are some synonyms:

Costs of Issuance - Cost of issuance, issuance costs, issuance expenses Underwriter's Discount - Underwriter Fee, Underwriting Fee, Underwriting Fees and Expenses Issue Premium - New Issue Premium Principal Amount - Principal, Bonds

It would be great to support wildcards because Costs of Issuance is often followed by an asterisk or integer referencing a footnote at the end of the table.

ahirner commented 8 years ago

Thanks @joffemd. Since the string matching already ranks the best hits with partial matching and editing distance, such minor permutations should be of no concern. This is the salient function for that: https://github.com/ahirner/TabulaRazr-OS/blob/xirr-specific/backend.py#L20

Thus, I think its a missing value of "Issue Premium", which is optional to factor into the calculation afaik. Is that correct? Maybe you could jot down the exact calculation for this one case to make sure.

joffemd commented 8 years ago

New Issue Premium is the opposite of New Issue Discount. You would use it in the calculation the same way, but just change the sign.

ahirner commented 8 years ago

@joffemd after a series of improvements, generalizations, preferring tables coming with same confidence from keyword search and more lenient rules of reckoning time-series and (year only instead of full date) the aforementioned document calculates with a plausible IRR. Please confirm: http://tabularazr.eastus.cloudapp.azure.com:7081/calculate_xirr/muni_bonds/ER942906-ER737178-ER1138804.pdf.txt

joffemd commented 8 years ago

@ahirner the results for this case look good. Would it be possible to make two quick cosmetic changes? (1) For the IRR result, move the % sign to the right of the value (change %4.13 to 4.13%), and (2) format all the cashflows with two decimal places and thousands separators (as commas).

ahirner commented 8 years ago

Done. I leave this issue open for further improvements.

Next step with greatest impact is to leverage synonyms for line items in the funds table as mentioned above. This Flint bond is an example where vocab is rather deviant.

joffemd commented 8 years ago

How difficult would it be to extract all the phrases found in Sources and Uses tables across a large number of bond documents and then count the number of occurrences. I could then classify the ones that are appear most frequently.

ahirner commented 8 years ago

Not at all, project wide filtering is already implemented as an endpoint. For the current online repository and the Sources and Uses filter this looks like this: http://tabularazr.eastus.cloudapp.azure.com:7081/filter_tables/muni_bonds?filter=funds

I would suggest a more convenient output for manual annotation such as an .xls file. Also, a batch upload script (curl POST) should be written for, lets say 1000 samples. This can be done within half a day.

joffemd commented 8 years ago

Thanks, Alex. There are 14,000 Official Statement PDFs in this folder: https://www.dropbox.com/sh/o3v8snzktn9eyb8/AAB1ELlc9vo6Mp9TkA_3P6uLa?dl=0

From: Alexander Hirner [mailto:notifications@github.com] Sent: Sunday, February 28, 2016 10:26 AM To: ahirner/TabulaRazr-OS TabulaRazr-OS@noreply.github.com Cc: Marc Joffe marc@publicsectorcredit.org Subject: Re: [TabulaRazr-OS] Make XIRR calculation working for more cases (#1)

Not at all, project wide filtering is already implemented as an endpoint. For the current online repository and the Sources and Uses filter this looks like this: http://tabularazr.eastus.cloudapp.azure.com:7081/filter_tables/muni_bonds?filter=funds

I would suggest a more convenient output for manual annotation such as an .xls file. Also, a batch upload script (curl POST) should be written for, lets say 1000 samples. This can be done within half a day.

— Reply to this email directly or view it on GitHubhttps://github.com/ahirner/TabulaRazr-OS/issues/1#issuecomment-189917604.